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Foreword 



Driven by the requirements of a large number of practical and commercially im- 
portant applications, the last decade has witnessed considerable advances in pat- 
tern recognition. Better understanding of the design issues and new paradigms, 
such as the Support Vector Machine, have contributed to the development of im- 
proved methods of pattern classification. However, while any performance gains 
are welcome, and often extremely significant from the practical point of view, it 
is increasingly more challenging to reach the point of perfection as defined by 
the theoretical optimality of decision making in a given decision framework. 

The asymptoticity of gains that can be made for a single classifier is a reflec- 
tion of the fact that any particular design, regardless of how good it is, simply 
provides just one estimate of the optimal decision rule. This observation has 
motivated the recent interest in Multiple Classifier Systems, which aim to make 
use of several designs jointly to obtain a better estimate of the optimal decision 
boundary and thus improve the system performance. This volume contains the 
proceedings of the international workshop on Multiple Classifier Systems held 
at Robinson College, Cambridge, United Kingdom (July 2-4, 2001), which was 
organized to provide a forum for researchers in this subject area to exchange 
views and report their latest results. 

Following its predecessor. Multiple Classifler Systems 2000 (Springer ISBN 3- 
540-67704-6), the particular aim of the MCS 2001 workshop was to bring together 
researchers from the diverse communities with interests in multiple classifiers: 
Machine Learning, Pattern Recognition, Neural Networks, and Statistics. This 
aim has been successfully accomplished, with this volume presenting 44 papers 
from the 4 different communities. The collection has been organized into thema- 
tic sessions dealing with bagging and boosting, MCS design methodology, ensem- 
ble classiflers, feature spaces for MCS, applications of MCS, one-class MCS and 
clustering, and. Anally, combination strategies. It includes contributions from the 
invited speakers: Tin Ho (Lucent Technologies, USA), Nathan Intrator (Tel-Aviv 
University, Israel), and David Hand (Imperial College of Science and Technology 
London) . 

The workshop was sponsored by the University of Surrey, Guildford, Uni- 
ted Kingdom and the University of Cagliari, Italy, and was co-sponsored by the 
International Association for Pattern Recognition through its Technical Com- 
mittees TCI: Statistical Pattern Recognition techniques, and TC16: Algebraic 
and Discrete Mathematical Techniques in Pattern Recognition and Image Ana- 
lysis, without whose support the workshop could not have taken place. Their 
financial assistance is gratefully acknowledged. 

We also wish to convey our gratitude to all those who helped to organize 
MCS 2001. First of all our thanks are due to the members of the Scientific 
Committee who selected the best papers from a large number of submissions to 
create an excellent technical content. Jon Benediktsson played a particularly im- 
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Foreword 



portant role in this context in soliciting contributions for the special session on 
remote sensing. Last but not the least, special thanks are due to the members of 
the Organizing Committee for their selfless effort to make MCS 2001 successful. 
Notably, we would like to thank David Windridge for his contribution to the 
production of this volume, Giorgio Giacinto and Giorgio Fumera for maintai- 
ning the MCS 2001 website and to Terry Windeatt for compiling the workshop 
program. 



Josef Kittler and Fabio Roli 

April 2001 
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Bagging and the Random Subspace Method for 
Redundant Feature Spaces 

Marina Skurichina and Robert RW.Duin 

Pattern Recognition Group, Department of Applied Physics, Faculty of Applied Sciences, 
Delft University of Technology, P.O. Box 5046, 2600GA Delft, The Netherlands 
{marina, duinj@ph.tn.tudelft.nl 



Abstract. The performance of a single weak classifier can be improved by using 
combining techniques such as bagging, boosting and the random subspace 
method. When applying them to linear discriminant analysis, it appears that they 
are useful in different situations. Their performance is strongly affected by the 
choice of the base classifier and the training sample size. As well, their 
usefulness depends on the data distribution. In this paper, on the example of the 
pseudo Fisher linear classifier, we study the effect of the redundancy in the data 
feature set on the performance of the random subspace method and bagging. 



1 Introduction 

In many applications of discriminant analysis, data often consist of a large number of 
measurements (features) with a relatively small number of observations (objects). In 
these circumstances, it may be difficult to construct a good single classification rule. 
Usually, a classifier, constructed on small training sets is biased and has a large 
variance. Consequently, such a classifier may be weak, having a poor performance [1]. 
One way to improve a weak classifier is to stabilize its decision, for instance, by 
regularization [2] or noise injection [3]. Another popular approach is to use a 
combined decision of many weak classifiers instead of a single classifier. The 
examples of such combining techniques are bagging [4], boosting [5] and the Random 
Subspace Method (RSM) [6], which are originally designed for decision trees. 

When applying bagging and the RSM to Linear Discriminant Analysis (LDA), it 
was established that these techniques are useful in different situations. Their 
performance is strongly affected by the choice of the base classifier and the training 
sample size [7]. As well, it was noticed that their relative performance differs for 
different data sets. It was demonstrated that the problem complexity (feature 
efficiency, the length of class boundary etc.) affects the performance of combining 
techniques [8]. In particular, it was shown that the RSM performs relatively better 
when a discrimination power is distributed evenly over many features. 

In this paper we study the effect of the redundancy in the feature set on the 
performance of bagging and the RSM. When data have many completely redundant 
noise features or many data features are highly correlated (many features contain the 
same information), the intrinsic (true) data dimensionality is smaller than the 
dimensionality of the feature space where data objects are described. The intrinsic data 
dimensionality is one of the important characteristics of the data set that may influence 
the performance of classifiers. Consequently, the redundancy in the data feature set 
(and the intrinsic data dimensionality) may affect the performance of the combining 
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techniques, as their performance depends on the training sample size referred to the 
data dimensionality. It is especially of interest for the RSM, where one constructs 
classifiers in random subspaces, because the number of informative features and the 
dimensionality of random subspaces should affect the performance of the RSM. 

In order to study the effect of redundancy in the data feature set on the performance 
of bagging and the RSM (which are discussed in section 2), we consider two cases of 
the redundancy representation in the data feature set. In the first case, data have many 
completely redundant noise features. In the second case, the useful information is 
spread over many features while the data itself have a low intrinsic dimensionality. The 
used data sets representing a two-class problem are described in section 3. We perform 
our simulation study on the example of the Pseudo Fisher Linear Discriminant (PFLD) 
[9]. Using this classifier as a single classification rule is not recommended because it is 
weak for critical training sample sizes (when the number of training objects is 
comparable with the data dimensionality) (see Fig. 1). However, bagging and the RSM 
are just designed for weak classifiers. So the PFLD is very suitable to be used as a base 
classifier in fhese combining fechniques when it is constructed on critical training 
sample sizes. Moreover, our previous study [7] has shown that bagging and the RSM 
are useful for the PFLD. The results of our simulation study of the effect of the 
redundancy in the data feature set on the performance of bagging and the RSM are 
discussed in section 4. Conclusions are summarized in section 5. 

2 Bagging and the Random Subspace Method 

In order to improve the performance of weak classifiers, a number of combining 
techniques can be used. Bagging and the RSM are two of them. They both modify the 
training data set by sampling either training objects (in bagging) or data features (in 
the RSM), build classifiers on these modified fraining sefs and fhen combine them into 
a final decision. Usually the simple majority vote is used to get a final decision. 
However, fhe weighted majority vote used in boosting [5] is more preferable because it 
is more resistant to overtraining (when increasing the number B of combined 
classifiers) than other combining rules [10]. Therefore, we use the weighted majority 
vote in both studied combining techniques. 




Fig. 1. The shifting effect of the generalization error (GE) of the PFLD for the RSM and 
bagging. In the RSM, the GE shifts with respect to the GE of the original classifier in the 
direction of the GE obtained on larger training sets. In bagging, the GE shifts with respect to the 
GE of the original classifier in the direction of the GE obtained on smaller training sets. (Here, 
n is the number of training objects and p is the data dimensionality.) 
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Bagging is proposed by Breiman [4] and based on bootstrapping [11] and 
aggregating concepts. By that, it incorporates benefits of both approaches. 
Bootstrapping is based on random sam|)ling with replacement. Therefore, taking a 
bootstrap replicate X = (Xj, X 2 , X^) (the random selection with replacement) of 

the training set X = (X-^, X 2 , X^) , one can sometimes avoid or get less 

misleading training objects in the bootstrap training set. Consequently, a classiher 
constructed on such a training set may have a better performance. Aggregating actually 
means combining classifiers. Often a combined classifier gives better results than 
individual classifiers. Therefore, bagging might be helpful to build better classifiers on 
training sample sets with misleaders. In bagging, bootstrapping and aggregating 
techniques are implemented in the following way. 

1. Repeat for b=l,2,...,B. 

a) Take a bootstrap replicate X of the training data set X . 

b) Construct a classifier C (x) (with a decision boundary C (jc) =0)on^X . 

c) Compute combining weights c^, = ^logf^i — 1 , where “ “ X 

yb if X: is classified corre^l^''^^ ”i=i 

2. Combine classifiers C (jc) , b=l,2,...,B, by the weighted majority vote with weights 

to a final decision rule p(x) = y ’ '^^ere 5,. ^. = j^’ [ 

is Kronecker symbol, ye{-l, 1} is a decision (class label) of the classiher. 

Bootstrapping is most efficient when the training set is in order of the data 
dimensionality. By this, bagging is useful for linear classihers constructed on critical 
training sample sizes, when they are unstable [12]. When bootstrapping the training set 
in bagging, in average only \-\!e= 63.2% of the training objects is used in each 
bootstrap replicate. In this way, the bootstrap sample is comparable with a smaller 
training set. So, the bagged classiher will have similar characteristics as the classiher 
built on the smaller training set. The generalization error of the bagged classifier is 
shifted with reference to the generalization error of the original classiher in the 
direction of the generalization error obtained on a smaller training set (see Fig. 1). 
Thus, the performance of bagging depends on the training sample size related to the 
data dimensionality and on the small sample size properties of the base classifier. It 
implies that, when applied to LDA, bagging is useful for classifiers with a non- 
decreasing learning curve (the dependency of the classihcation error as a function on 
the training sample size) constructed on critical training sample sizes [12]. 

The Random Subspace Method is the combining technique proposed by Ho [6]. In 
the RSM, one also samples the training data. However, the sampling is performed in 
the feature space. Let each training object X- = (x,j, ...,x,^) (i=\,...,n) in the 

training sample set X = (Xj, X 2 , ..., Xf) be a p-dimensional vector, described by p 
features. In the RSM, one randomly selects p *<p features from the p-dimensional data 
set X . By this, one obtains the p *-dimensional random subspace of the original 
dimensional feature space. So the modihed training set X = (Xi, X 2 , ..., X„) 
consists of p*-dimensional training objects X,- = {x^^, x^ 2 , ■■■, x. ,) (;=l,...,n), where 
p* components x^j ij=\,...,p*) are randomly selected from p components x^j 
ij=\,...,p) of the training vector X,- (the selection is the sarne for each training vector). 
Then one constructs classihers in the random subspaces X and aggregates them b in 
the hnal decision rule. Namely, the RSM is organized in the following way. 
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1. Repeat for b=\,2,...,B. ^ ^ 

a) Select the p* -dimensional random subspace X from the original p- 



dimensional feature space X . _ ^ 

b) Construct a classifier C (jc) (with a decision boundary C (jc) = 0 ) in X 

■err\ , 1 

-b- 2""'' 



c) Compute combining weights = ^logf— 
[0, if X- is classified correctly 
1, otherwise 



and 



err. 



^ I, where err^ = ;; X 

i = 1 



2. Combine classifiers C (x) , b=l,2,...,B, by the weighted majority vote with weights 

rXf' . -h . . wtiprp 5 .' ■= il’ f ^ 

y e {-J, 1} b 

{-1,1} is a decision (class label) of the classifier. 



to a final decision rule p(x) = ,gn{c\x)), y ’ where 5, 



is Kronecker symbol, y ' 



‘’J I0,i>y; 



The RSM may benefit from both, using random subspaces for constructing the classifi- 
ers and aggregating the classifiers. In the case, when the number of training objects is 
relatively small as compared with the data dimensionality, by constructing classifiers 
in random subspaces one may solve the small sample size problem, because the train- 
ing sample size relatively increases in random subspaces. When data have many redun- 
dant features, one may obtain better classifiers in random subspaces than in the 
original feature space. The combined decision of such classifiers may be superior to a 
single classifier constructed on the original training set in the complete feature space. 

The performance of the RSM is also affected by the training sample size and the 
small sample size properties of the base classier [7]. The subspace dimensionality is 
smaller than in the original feature space while the number of training objects remains 
the same. By this, the relative training sample size increases. Similar to bagging, in the 
RSM, the final classifier will have a shifting effect of the generalization error with 
respect to the generalization error of the original classifier (see Fig. 1). However, fhis 
shift will be in the opposite direction: in the direction of the generalization error 
obtained on larger training sample sizes. So the RSM is useful for classifiers having a 
decreasing learning curve constructed on small and critical training sample sizes. 

Thus, bagging and the RSM may be useful for linear classifiers consfrucfed on 
critical training sample sizes. However, bagging is beneficial for linear classifiers with 
a non-decreasing learning curve, while the RSM is useful for linear classifiers having a 
decreasing learning curve. The PFLD is weak when it is constructed on critical 
training sample sizes, having a peak of the generalization error at point n=p (see Fig. 
1). By this, the learning curve of the PFLD increases for training sample sizes n<p, and 
decreases for n>p. Therefore, both, bagging and the RSM, may improve the 
performance of the single PFLD constructed on critical training sample sizes. 



3 Data 



In our experimental investigations we considered one artificial and five real data sets 
representing a two-class problem, which we have modified in order to get data sets 
having a high redundancy in the data feature space. 

The artificial dafa set used to obtain data with many redundant features is the 2- 
dimensional correlated Gaussian data set constituted by two classes with equal 
covariance matrices. Each class consists of 500 vectors. The mean of the first class is 
zero for both features. The mean of the second class is equal to 3 for both features. The 
common covariance matrix is a diagonal matrix with a variance of 40 for the second 
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feature and unit variance for the first feature. This data set is rotated by using a rotation 
matrix /{ = ^ |^J “'j . 

The real data sets, used to obtain highly redundant data sets, are taken from the 
UCI Repository [14]. They are the 34-dimensional ionosphere data set, the 8- 
dimensional pima-diabetes data set, the 60-dimensional sonar data set, the 30- 
dimensional wdbc data set and the 24-dimensional german data set. 

In order to make these data sets highly redundant in the feature space, to each of 
them we added r completely redundant noise features distributed normally A^(0,10'"^). 
Thus, we have obtained pH-r-dimensional data sets with many completely redundant 
noise features (one way of the redundancy representation in the data feature set). Here, 
p is the dimensionality of the original data set. In order to obtain data sets where a 
useful information is spread over all p+r features (other way of the redundancy 
representation in the feature set), we have rotated the data enriched by noise features. 
The rotation was performed in all p+r dimensions by using a Hadamard matrix [15]. 
By this, we have obtained the data sets where the classification ability (discrimination 
power) is spread over all features. 

Training sets are chosen randomly from a total data set. The remaining data are 
used for testing. All experiments are repeated 50 times on independent training sets. So 
all figures show the averaged results over 50 repetitions. The standard deviations of the 
mean generalization errors for the single and combined PFLD’s are around 0.01 for 
each data set. 

In bagging and the RSM, ensembles consist of 100 classifiers combined by the 
weighted majority vote in the final decision. The size p* of random subspaces used in 
the RSM is 10. 

4 The Effect of the Redundancy in the Data Feature Space 

In order to study the effect of redundancy in the data feature set on the performance of 
bagging and the RSM, we have modified several data sets by artificially increasing 
their redundancy in the data feature set (see previous section). In order to investigate 
whether redundancy representation in the data feature space affects the performance of 
the combining techniques, we modify the original data sets in two different ways: 1) 
just adding r completely redundant noise features to the original p-dimensional data 
set (by this, a discrimination power is condensed in the subspace described by the 
original /7-dimensional data set), 2) additionally rotating the data set in all p+r 
dimensional space after injecting r completely redundant noise features to the original 
data set (by this, a discrimination power is spread over all p+r features). 

Fig. 2 shows learning curves of the single PFLD, bagging and the RSM for the 80- 
dimensional data sets (p+r=^0). Fig. 3 represents the generalization errors versus the 
number of redundant noise features r when the training sample size is fixed to 40 
objects per class (80 training objects in total). Left plots in Fig. 2 and 3 show the case 
when the classification ability is concentrated in p features. Right plots represent the 
case when the discrimination power is spread over all p+r features. Figures show that 
both combining techniques are useful in highly redundant feature spaces. The RSM 
performs relatively better when the discrimination power is spread over many features 
than when it is condensed in few features (this coincides with results obtained by Ho 
for decision trees [8]). When all data features are informative, the increasing 
redundancy in the data feature set does not affect the performance of the RSM. This 
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can be explained as follows. In order to construct good classifiers in random 
subspaces, it is important that each suhspace would contain as much as possible useful 
information. This could he achieved only when information is “uniformly” spread over 
all features (it is especially important when small random subspaces are used in the 
RSM). If useful information is condensed in few features, many random subspaces 
will be “empty” of useful information, the classifiers constructed in them will be bad 
and may worsen the combined decision. 

In contrast to the RSM, the performance of bagging is affected neither by the 
redundancy representation in the data feature set, nor by the increasing feature 
redundancy (it is affected by the data dimensionality referred to the training sample 
size). It happens because all features are kept in bagging when training objects are 
sampled. Rotation does not change the informativity of the bootstrap replicate of the 



discrimination power is condensed discrimination power is spread over 

in p features all p+r features 

C3 




Fig. 2. Learning curves of the single PFLD, bagging and the RSM (p*=10) for the 80- 
dimensional data sets {p+r=^0). Left plots show the case when discrimination power is 
condensed in p features (data have r completely redundant noise features). Right plots show the 
case when discrimination power is spread over all features (after adding r redundant features to 
p-dimensional original data set, the data are rotated in all p-l-r=80 directions). 
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training set. Thus, bagging may perform better than the RSM for the highly redundant 
feature spaces when the discrimination power is condensed in few features and the 
training sample size is small. However, the RSM outperforms bagging when the 
discrimination power is distributed over all data features. 

As well, besides the redundancy in the data feature set, other factors (e.g., the class 
overlap, the data distribution etc.) affect the performance of the PFLD, bagging and the 
RSM. Usually many factors act simultaneously, assisting and counteracting each other. 
Sometimes, some factors may have a stronger influence than the redundancy in the 
data feature set. For instance, the sonar data set represents time signals where the order 
of features jointly with their values are very important. When rotating this data set 
enriched by redundant noise features, some important information is lost. By this 
reason, the RSM performs worse on the rotated sonar data set than when no rotation is 

discrimination power is condensed discrimination power is spread over 

in p features all p+r features 
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Fig. 2. Learning curves of the single PFLD, bagging and the RSM (p*=10) for the 80- 
dimensional data sets (p+r=80). Left plots show the case when discrimination power is 
condensed in p features (data have r completely redundant noise features). Right plots show the 
case when discrimination power is spread over all features (after adding r redundant features to 
p-dimensional original data set, the data are rotated in all p+r^80 directions). 
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performed (see Fig. 2g,h and Fig. 3g,h). Another example, where rotation does not 
improve the performance of the RSM, is the modified wdbc data set (see Fig. 2i,j and 
Fig. 3i,j). All features of the original data set are already strongly correlated and it has 
the largest Mahalanobis distance (d=14.652) among all considered data sets. It seems 
that rotation somewhat “worsens” the distribution of the useful information over the 
features. So the RSM performs worse on the rotated data set than when no rotation is 
performed. 

5 Conclusions 

Besides the training sample size related to the data dimensionality and the choice of 
the base classifier, the efficiency of the combining techniques may depend on the level 

discrimination power is condensed discrimination power is spread over 

in p features all p+r features 




The Number of Redundant Features The Number of Redundant Features 



Fig. 3. The generalization errors of the single and the combined PFLD’s versus the number of 
redundant noise features r added to the original p-dimensional data sets. Left plots show the 
case when a discrimination power is condensed in p features (data have r completely redun- 
dant noise features). Right plots represent the case when a discrimination power is spread over 
p+r features (after adding r redundant features to p-dimensional data set, the data are rotated in 
all p+r directions). 80 training objects are used to train the single and the combined classifiers. 
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of redundancy in the data feature set and on the way this redundancy is presented. 

When applied to the Pseudo Fisher linear classifier, the RSM performs relatively 
better when the classification ability (discrimination power and also the redundancy) is 
spread over many features (i.e., for the data sets having many informative features) 
than when the classification ability is condensed in few features (i.e., for the data sets 
with many completely redundant noise features). When the discrimination power is 
spread over all features, the RSM is resistant to the increasing redundancy in the data 
feature set. 

Unlike the RSM, the performance of bagging, when applied to the PFLD, depends 
neither on the redundancy representation nor on the level of redundancy in the data 



discrimination power is condensed discrimination power is spread over 

in p features all p+r features 




The Number of Redundant Features The Number of Redundant Features 

Fig. 3. The generalization errors of the single and the combined PFLD’s versus the number of 
redundant noise features r added to the original p-dimensional data sets. Left plots show the 
case when a discrimination power is condensed in p features (data have r completely redun- 
dant noise features). Right plots represent the case when a discrimination power is spread over 
p+r features (after adding r redundant features to p-dimensional data set, the data are rotated in 
all p+r directions). 80 training objects are used to train the single and the combined classifiers. 
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feature set (it depends on the data dimensionality related to the training sample size). 
Therefore, bagging may perform better than the RSM for the highly redundant feature 
spaces where the discrimination power is condensed in few features. However, when 
the discrimination power is spread over all features, the RSM outperforms bagging. 

The success of the combining techniques depends on many different factors that act 
simultaneously and may assist and counteract each other. However, notwithstanding 
the difficulty to study the influence of each factor independently of other ones, it is 
very important to understand what factors affect the performance of the combining 
techniques and in what way. Obviously, more study should be done in this direction. 
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Abstract. AdaBoost boosts the performance of a weak learner by train- 
ing a committee of weak learners which learn different features of the 
training sample space with different emphasis and jointly perform classi- 
fication or regression of each new data sample by a weighted cumulative 
vote. We use RBF kernel classifiers to demonstrate that boosting a Strong 
Learner generally contributes to performance degradation, and identify 
three patterns of performance degradation due to three different strength 
levels of the underlying learner. We demonstrate that boosting produc- 
tivity increases, peaks and then falls as the strength of the underlying 
learner increases. We highlight patterns of behaviour in the distribution 
and argue that AdaBoost’s characteristic of forcing the strong learner to 
concentrate on the very hard samples or outliers with too much emphasis 
is the cause of performance degradation in Strong Learner boosting. How- 
ever, by boosting an underlying classifier of appropriately low strength, 
we are able to boost the performance of the committee to achieve or 
surpass the performance levels achievable by strengthening the individ- 
ual classifier with parameter or model selection in many instances. We 
conclude that, if the strength of the underlying learner approaches the 
identified strength levels, it is possible to avoid performance degradation 
and achieve high productivity in boosting by weakening the learner prior 
to boosting . . . 



1 Introduction 

Freund & Schapire’s algorithm AdaBoost and its variants have been applied 
extensively for boosting the performance of weak learners. However, boosting 
tends to fail or degrade performance in some instances. Quinlan first attracted 
attention to performance degradation in his early experiments of boosting C4.5. 
Freund & Schapire record further results on performance degradation of C4.5. 
In their recent work, Freund & Schapire [2| pose this as an open problem: can 
we characterise or predict when boosting will fail in this manner? 

We are strongly motivated to research the effects of the strength of the learner 
on boosting performance, paying specific attention to instances of performance 
degradation. This paper is an exposition of the effects of the strength of the 
underlying learner on the performance of AdaBoost and a response to the open 
problem posed by Freund & Shapire. 
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The learnability of a problem by a classifier is dependent on the inherent 
difficulty of the problem, the capacity of the classifier, and the size of the training 
dataset relative to the size of the problem. Given a learning problem posed by 
a training dataset and a test dataset of fixed size, we form a judgement on the 
strength of a learner of fixed capacity, based on how well it learns the original 
training data supported by how well it performs on the test data. Variation 
in learner strength is achieved by varying the capacity of the learner from one 
committee to another. To achieve a weaker classifier it is only necessary to vary 
a parameter or the kernel, so that the classifier gives a lower training accuracy 
and a lower test accuracy according to the given training and test datasets 
respectively. 

We watch the annealing process variations as the strength of the learner 
increases from one committee to another. We watch the effects of the strength 
of the learner on the distribution and highlight very insightful patterns in its 
behaviour. The results highlight the influence of the strength of the learner 
on the boosting process at different levels of classifier strength. They highlight 
three different patterns of performance degradation depending on three different 
high-strength levels of the learner. From our results we argue that AdaBoost’s 
characteristic of forcing an underlying strong learner to concentrate on a few 
hard-to-learn samples or outliers with too much emphasis is the probable cause 
of performance degradation in boosting. We observe that boosting a learner of 
moderate strength is optimally productive and often achieves performance levels 
that surpass the performance of the strengthened individual classifier. Hence, 
we conclude by proposing that a learner approaching the high strength levels 
identified on a given problem instance be weakened prior to boosting, to achieve 
optimal productivity and to lower the probability of performance degradation. 

2 Background and the Experiments 

AdaBoost: AdaBoost.Ml is used for boosting RBF classifiers on the two class 
classification problems Monks 1, Monks2, MonksS, Poisonous Mushroom, Credit 
Scoring and Tic-Tac-Toe Endgame from the UCI data repository. 

Training the Learner: A Radial Basis Function classifier is trained as the 
underlying learner with variations in its strength being achieved by changing 
the number K of kernels. K is an input parameter to the training algorithm and 
is allowed to take a value between 2 and N, the number of training samples. 
The first layer projects the input samples to the K dimensional feature space by 
using spherical Gaussian kernels. The K function centres are trained by K-means 
clustering of the input samples. The function widths are set to twice the mean 
distance between the function centres. The final layer optimises the input samples 
projected onto the feature space by linear optimisation of the sum-of-squares 
error, and the real valued output is thresholded to provide a binary classification. 
The training algorithm allows us to train an RBF classifier of desired complexity 
on the training dataset. Given K and T, the number of iterations, AdaBoost 
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Given: Training dataset: (a;i, j/i), . . . , {xn, Vn) 

Test dataset: (xi,yi), . . . ,{xm,Vm) 

where Xi £ X,yi gY = { — 1, +1} 

Initialise: D\{i) = ^ 

For t = 1, . . . ,T 

• Train weak learner with fixed capacity according to distribution Dt 

• If (t = 1) 

Strain t— error on Training dataset 

ctest error on Test dataset 

Classifier Strength :=/(! — ttrain, 1 — ctest) 

• Get weak hypothesis ht : X — >■ { — 1,+1} 

with error et = Pri-Dt[ht{xi) ^ yi\ 

• Update 

_ Dt{i) f e"“* , if ht(xi) = yi 

Zt \e“* , a ht{xi) yi 

where Zt is the normalisation factor chosen so that Dt+i is a distribution 

Output Classifier Strength and the final hypothesis 
H{x) = sign o^tht{x)) 



Fig. 1. AdaBoost.Ml (Freund & Schapire) with a measure of the underlying classifier 
strength. 



trains a committee of T RBF classifiers, each with approximately K number 
of kernels. A classifier is trained according to the distribution by sampling the 
training data according to the distribution with replacement. 

It is noted that the classifier strength increases with K. The graphs of Fig. 2 
plot the performance of the learners against K . 



The Strength of the Learner: The strength of the classifier is defined with re- 
spect to the problem instance given by a fixed dataset, not the unknown problem. 
Subsequent performance improvements are judged relative to the fixed dataset. 
The complexity of the classifiers for a particular AdaBoost run is also fixed, so 
that all the classifiers of a particular committee have same capacity. 

Within this context, an intuitive measure of the strength of the classifier 
being boosted is formed, based primarily on its performance on the training 
samples and supported by its performance on the test samples. The training 
accuracy is an important measure of the strength of the classifier in boosting as 
the committee concentrates on reducing its training error in a greedy manner 
by AdaBoost’s choice of the voting parameter a*. j0| However, we find that 
the generalisation accuracy must also support its strength. A classifier showing 
95% accuracy on the given training dataset and 95% accuracy on the given test 
dataset is clearly a strong learner of the problem instance. The RBF classifier 
with 25 function centres on the Tic-Tac-Toe dataset having approximately 90% 
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Fig. 2. RBF performance graphs for Monksl, Monks2, Credit, MonksS, Tic-Tac-Toe 
and Mushroom datasets. Performance on the training data (plotted in red dots) and on 
the test data (plotted in blue squares). The RBF classifier starts as a weak learner and 
becomes strong as the complexity increases for Monksl, Monks2, and Credit Scoring 
datasets. It is a strong learner of MonksS, Tic-Tac-Toe and Mushroom datasets. (Sub- 
optimal classifiers are due to the inverted matrix in linear optimisation being near 
degenerate.) 
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training accuracy and 95% test accuracy is a significantly stronger learner than 
the RBF classifier with 30 function centres on the Monksl dataset also having 
approximately 90% training accuracy, but only 82% test accuracy. However, a 
very high training accuracy (e.g. 94%) may be supported by a moderate test 
accuracy (e.g. 75%) and still show strong learner behaviour. Alternatively, when 
the test accuracy is very high (e.g. 97% for Tic-Tac-Toe) strong learner behaviour 
can start at classifiers with a lower training accuracy. 



3 Boosting a Strong Learner Is Generally 
Counterproductive or Unproductive 

As the strength of the classifier increases beyond approximately 90% (training 
accuracy above 90% and supported by high test performance), boosting consis- 
tently becomes counterproductive or unproductive. For the particular datasets 
the onset of unproductive or counterproductive boosting behaviour occurs when 
the base classifier is stronger than (approximately) the classifiers of Table 1. For 
all RBF classifiers stronger than these on the particular datasets, unproductive 
or counterproductive boosting behaviour is consistent. 



Table 1. The classifier strength at which consistent unprodnctive or counterproductive 
behaviour starts for datasets tested 



Dataset/Problem Title 


Complexity 


Training 

Performance 


Test 

Performance 


Behaviour 


Monksl 


47 


92.5 


85.0 


Unproductive 


Monks2 


75 


92.5 


72 


Unproductive 


Credit Scoring 


45 


93.9 


81.0 


Unproductive 


Monks3 


18 


93.0 


93.0 


Counterproductive 


Tic-Tac-Toe Endgame 


20 


89.0 


95.0 


Counterproductive 


Poisonous Mushroom 


8 (lowest tested) 


96.0 


98.0 


Counterproductive 



It is noted that, in the case of unproductive behaviour, the high training 
accuracy is supported by a relatively modest test performance. Careful obser- 
vation of the graphed distribution jSj indicates that a relatively small number 
of samples gets strongly highlighted and that their emphasis in the distribution 
keeps growing at the expense of the emphasis on the other samples throughout 
the boosting process, but never comes down. The low generalisation performance 
of the base classifier seems to contribute to AdaBoost’s being unable to train a 
classifier in subsequent iterations that is capable of learning the highlighted very 
hard samples. This results in AdaBoost’s being unable to affect either training 
or test error. 

When a high training accuracy is supported by a high generalisation perfor- 
mance, AdaBoost is able to train hypotheses that learn the highly emphasised 
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Fig. 3. Patterns of Performance Degradation I: Unprodnctive Boosting 



hard samples. Then the training error drops to zero, but the test error contin- 
ues to increase throughout. The distribution repeatedly highlights a very small 
number of samples very strongly; even though their weighting comes down with 
a good hypothesis that learns them, they are soon highlighted strongly again 
and again. P] Hence too much emphasis is placed on the hard-to-learn outliers 
or the few very hard samples in the training data throughout, causing AdaBoost 
to continuously overfit the training data. 



Mushroota Boosted Errors for averake K = 24 





Fig. 4. Patterns of Performance Degradation II: Continuously Counterproductive 
Boosting 



Onoda, Ratsch and Muller ^ conduct a theoretical analysis that confirms the 
conclusion drawn from our experimental results. They note: “when the annealing 
parameter |6| takes a very big value, AdaBoost type learning become a hard 
competition case: only the patterns with smallest margins (hardest samples) 
will get high weights; other patterns are effectively neglected in the learning 
process” . This analysis confirms the conclusion drawn from our observations and 
leads us to conclude that too much emphasis placed on a few hard samples in 
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the training dataset, and the neglect of important features of the other samples 
is the probable cause for performance degradation in strong learner boosting. 

Counterproductive boosting results recorded by Freund and Schapire P 
on boosting C4.5 on Soybean-small, House-votes-84, Votesl and Hypothyroid 
datasets all have very high test performances; the training accuracy is unknown. 
However, Freund and Schapire record a number of productive instances of boost- 
ing C4.5 where the generalisation performance is high; the training accuracy is 
unknown. The (unknown) training accuracy is a strong factor in our measure of 
learner strength. However, this warrants further investigation. 



TicTacToe Boosted Errors for auerake K = 16 



ttonksS Boosted Errors for auerake K = 14 





Fig. 5. Patterns of Performance Degradation III - Initially high degradation as the 
training error drops to zero, followed by a correction that is unable to compensate for 
the initial overfitting 



4 Behaviour of the Distribution 

The distribution over the training samples was graphed. Its careful observation 
gives an intuitive insight to the error reduction process. 

When the classifier is very weak it highlights a large number of samples as 
hard and the weighting is distributed over a large number of samples. The dis- 
tinction between weightings is small. There is a lot of activity in the distribution 
as the weightings change by small amounts over a large set of samples from iter- 
ation to iteration. Thus each iteration poses a large region in the feature space 
with little distinction among sample hardness, to be learned by a relatively weak 
classifier. This contributes to the slow error reduction process. 

When the classifier is a moderate performer fewer samples are highlighted, 
allowing a classifier of moderate strength to focus on a moderate size region in 
the feature space. When they are learned, their weighting comes down allowing 
another moderate set to be highlighted. Their features are quickly learned to 
achieve very fast and optimal error reduction. 

As the classifier gets even less weak, a smaller set of somewhat harder samples 
is highlighted. It takes a number of attempts to learn their features during which 
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they remain highly emphasized with little activity. As soon as these samples are 
learned (noted by a strong performing hypothesis) their weighting comes down 
suddenly allowing a different small set to be highlighted; this activity generally 
coincides with a “step reduction” of the errors. 

When the classifier is somewhat stronger, a few “significantly hard” samples 
are very strongly highlighted initially. In its attempt to learn the features of these 
few samples particular to the training data, AdaBoost seems forced to ignore the 
less hard but important low-highlighted samples (the weighting on the few very 
hard samples keep increasing at the cost of the others) and overfits the training 
dataset. Once they are learned, a correction process immediately follows, where 
AdaBoost relearns the important features of the other hard samples with the 
right emphasis (their weighting is allowed to rise), thereby reducing the gener- 
alization error somewhat. However, the subsequent reduction is often unable to 
compensate for the initial overfitting. This contributes to a third form of perfor- 
mance degradation, when the classifier strength is just below the strength levels 
of those discussed in 3. 

The behaviour of the distribution when the underlying classifier is very strong 
is analysed in section 3. 



5 Boosting Performance Increases, Peaks, and then Falls 
as the Strength of the Weak Learner Increases 

Boosting performance graphs are taken for committees with decreasing levels 
of learner weakness for Monksl, Monks2 and Credit datasets. The decrease in 
weakness from one committee to the next is achieved by increasing the number 
of kernels K in the underlying classifier. 

When the classifier is significantly weak the boosting process is slow and 
ridden with fluctuations. When the weakness of the classifier decreases to a 
certain level, AdaBoost achieves very fast and maximum error reduction. 

AdaBoost is able to also boost the performance of a somewhat strong learner. 
A clear distinction of such a boosting instance is the “step reduction process” 
where the errors remain stable for a number of iterations and suddenly reduce 
by a significant amount. 

AdaBoost ’s ability to boost the generalisation performance lessens as the 
RBF classifiers get stronger beyond a particular level. When the classifier’s 
strength increases further, it overfits the training data and increases the general- 
ization error by a significant amount as the training error rapidly drops to zero. 
After the training error has dropped to zero the generalization error resumes 
the reduction process. However, in many instances the subsequent reduction is 
unable to compensate for the initial overfitting. This shows a third pattern of per- 
formance degradation. The performance degradation graph recorded by Quinlan 
0 in boosting C4.5 on the Colic dataset also demonstrates this behaviour. The 
test performance of the base C4.5 classifier is reported as 85.08%; the training 
accuracy is unknown. 
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tlonksl Boosted Errors for averake K = 10 



tlonksl Boosted Errors for averake K = 43 





tlonksl Boosted Errors for averake K = 25 



tlonksl Boosted Errors for averake K = 47 





tlonksl Boosted Errors for averake K = 34 tlonksl Boosted Errors for averake K = 53 





Fig. 6. Monksl boosting for if=10,25,34,43,47,53. Error reduction is slow for very 
weak learner boosting. Boosting achieves very fast and optimal error reduction as the 
classifier weakness decreases. Initial overfitting effects and “step reduction” in errors 
shown as the weakness decreases further. Boosting becomes unproductive as the clas- 
sifier becomes a Strong Learner. Detailed graphs in full paper |3] 



When the strength of the classifier is very high the annealing behaviour is 
analysed in detail in section 3. 

AdaBoost is hence optimally productive when the underlying classifier is 
moderately weak. It is slow and has low productivity when the learner is signif- 
icantly weak. It is generally unproductive or counterproductive in boosting an 
already strong learner. If it is not possible to weaken a strong classifier to an 
appropriate level, it is better to train a good individual classifier by optimising 
on parameter and model selection. 
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The AdaBoost annealing process is graphed for Monksl for RBF classifiers 
with increasing strength in Fig. El The annealing process follows a very simi- 
lar pattern for datasets Monks2 and Credit Scoring. More detailed graphs for 
Monksl, Monks2 and Credit scoring are provided in the full paper 0. 

6 Boosting an Appropriately Weakened Classifier Can 
Improve on the Performance of the Same Classifier 
Strengthened by Parameter or Model Selection 



The strong classifier is weakened prior to boosting. In our definition of learner 
strength, and our experiments, the capacity of all the classifiers in a particular 
committee is fixed. Hence the weakening is achieved by varying a free parame- 
ter or the kernel (thereby changing this capacity), so that the learner shows a 
higher training error and a higher test error with respect to the same dataset. 
Weakening in our experiments is achieved by decreasing the number of kernels. 
The boosting performance of the weakened classifier is graphed together with 
the peak performance achievable by the classifier individually(straight line in 
graph). It is clear from the graphs of Fig. El that the boosted performance is 
able to improve on the performance of the classifier optimised with respect to 
K for weak learners of Monksl. (Further graphs in [^) Boosting the weaker 
learner similarly improves on the peak performance for Monks2 and achieves 
performance close to the peak performance for Credit Scoring. A vast amount of 
published literature, in particular |I| and Eli report further results and analy- 
ses of significant performance improvements achieved by the boosted committee 
over the individual classifier when the learner is not strong. 

The weakened learner causes AdaBoost to concentrate on the samples mod- 
elling the more representative regions in the feature space, and reduces the ex- 
cessive emphasis on the few boundary samples. Hence, the probability of perfor- 
mance degradation is decreased, and the committee is often able to learn more 
regions in the feature space than an individual learner. 



7 Conclusion 

We have demonstrated that boosting an already strong learner is generally coun- 
terproductive, and that the boosting performance increases, peaks and then falls 
as the strength of the underlying weak learner increases. We have identified 
three patterns of performance degradation depending on three different identi- 
fied strength levels of the underlying learner: 

1. Unproductive boosting behaviour, when the base learner has high training 
accuracy supported by a modest test accuracy. 

2. Continuously counterproductive boosting behaviour when the base learner 
has high training accuracy supported by a high test accuracy. 
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3. Initially highly counterproductive boosting behaviour followed by a slightly 
productive phase that is unable to compensate for the initial overfitting, 
when the base learner is just approaching the strength levels of instances 
discussed in 1 and 2. 

We have highlighted patterns of behaviour in the distribution and have ar- 
gued that performance degradation in boosting is due to: 

1. An underlying strong learner 

2. AdaBoost’s characteristic of forcing the learner to concentrate on the very 
hard samples and outliers with too much emphasis when the learner is strong. 

However, boosting a learner of appropriately low strength achieves good 
error reduction and, in some instances, improves on the peak performance the 
strengthened learner is capable of achieving by parameter or model selection. 
We have proposed therefore that, when the underlying learner approaches the 
Strong Learner strength levels identified, it is possible to avoid the probability 
of performance degradation and to achieve higher boosting productivity by 
weakening the learner. It is possible to weaken a kernel machine that is a Strong 
Learner by using a weaker kernel, decreasing the number of kernels or varying a 
free parameter, so that the learner has a higher training error and a higher test 
error with respect to the given dataset. 

The proposed solution works by eliminating the first identified cause of per- 
formance degradation. Future work would address improving AdaBoost so that 
it curbs the excessive emphasis it places on the few extremely hard samples and 
outliers when the underlying learner is strong. 
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Abstract. A communication model for the Hypothesis Boosting (HB) problem 
is proposed. Under this model, AdaBoost algorithm can be viewed as a 
threshold decoding approach for a repetition code. Generalization of such 
decoding view under theory of theory of Recursive Error Correcting Codes 
allows the formulation of a generalized class of low-complexity learning 
algorithms applicable in high dimensional classification problems. In this paper, 
an instance of this approach suitable for High Dimensional Eeatures Spaces 
(HDFS) is presented. 



1 Introduction 

Established Machine Learning boosting [1] theory assumes a low dimensional feature 
space setting. The extension of boosting to arbitrary HDFS is an area of potential 
interest [2] in fields like Information Retrieval. In this paper, we address the extension 
of the HB concept to HDFS by recalling common sense teaching-learning strategies 
and their similarity to the design of RECCs. 

The remainder of the paper is organized as follows. Section 2 introduces a 
communication model for the HB problem and the interpretation of AdaBoost 
algorithm as an instance of APP threshold decoding. Section 3 introduces a 
generalized recursive learning approach in order to cope with complexity when 
constructing boosting algorithms in high dimensional spaces. Section 4 presents a first 
stage implementation suitable for high input domains through the Turbo_Learn 
algorithm. Finally, in Section 5 a summary and future work is presented. 



2 Teaching and Learning Strategies 

How must we teach and how can we learn? Both questions are essential in the design 
of ML algorithms. Consider a teaching through examples process for a target concept 
c belonging to a class C : A -> {-1,1} . Similarly, let WL be a weak learning from 
examples algorithm and let S be the training sample. Trivial repetition of the target 
concept may he considered as a simple teaching strategy in order to improve the WL 
performance. 
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Such strategy can be implemented by exposing S as many times as WL requires, 
reinforcing the presence of harder examples each time. In Machine Learning theory, 
the above teaching strategy is no more than the hypothesis-boosting concept. In the 
next section, we will show that the HB problem can be viewed the transmission of 
concepts through a noisy channel. Thus, under suitable (concept) channel-encoding, 
arbitrary small (learning) error rates can be achieved. 



3 A Communication Model for HB 

Transmission of information through a noisy channel requires channel-encoding [3] 
schemes. Let us consider the transmission of concepts c belonging to a target class 
C : X ^{0,1} imbedded in some metric space R p . Assume that transmission is 

intended with accuracy e so that C can described by a set Ac with A * ( C ) 
elements (the set A being a minimal s - net for C under covering numbers theory [4] 
[5]). Following [6], for each c eC we can define a deterministic mapping E:C B , 
so that each concept can be represented by a bitstream b e S with length n b 

mb (C) = log 2 Af (C) (1) 

In order to transmit ceC, E simply selects the integer )s{l,....,A* (C)} for 
which the ball Ball {a j ,s) with center in concept aj and radio s contains the target 
concept c . Similarly, we can define a decoding mapping D : B -> R so that a received 
bitstream b e B is mapped into a concept a j being j the integer with bit 
representation sequence b e B . In communication terms, the mapping £ : C B can 
be modeled as a Discrete Memoryless Source (DMS) with output alphabet X , 

I X I = A * ( C ) . Let q he cl DMS output distribution and let A ( X ) be the entropy 
characterizing such DMS 

PU {ak ) = qk k = l, ,A* {C) Qk eX 

Let us consider the transmission of information symbols from such source through 
a Discrete Memoryless Channel (DMC) [7] characterized by a finite capacity and 
resembling a weak learner behavior. Shannon’s Noisy Coding theorem [8] states that 
reliable transmission of information through a noisy channel can be achieved by 
suitable channel encoding. Coding proceeds by transmission of arbitrary long T 
source sequences at a rate information symbol r = ^ being k the number of 

information symbols in each T - sequence. In almost random encoding is performed 
at the transmitter side, then as T -> oo , the bock error probability can be bounded 

as follows 

P « (3) 

Whenever r is less than channel capacity Cn , arbitrary small (learning) error rates 
can be achieved by the suitable introduction of parity redundant concepts. It should be 
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note that for the learning case, r values are limited to j ( fe = 1 ) if learning proceeds 

in a concept by concept fashion. For this case, the unique allowable linear block- 
coding scheme is a T - repetition code. Thus, in order to cope with learner limitations, 
a teacher would repeat the target concept T times, resembling the transmission of a 
codeword t 



t = (c{x), c{x), ..., c{x)) (4) 

T-times 

Under the assumption of a weak learning algorithm with errors resembling a DMC 
channel, learning becomes a decoding problem on a received sequence r 

r = ( /ij ( x% h 2 (.r), ^ r ( -'^ ) ) (^) 

For binary transmitted and received concepts 

r = t + e mod 2 (6) 

The decoding problem is to give a good estimate e for the error vector e under prior 
knowledge on channel behavior by means of probabilities = P{ci = 1) , 1 < i < T , so 
that a final estimation t* = r + e can be assembled. Therefore, a suitable learning 

algorithm in some aspects should correlate the behavior of decoding schemes. From 
learning theory, we know that adaptation is a desirable feature for good generalization 
abilities and in fact, the same requirement applies for decoding algorithms when 
dealing with very noisy channels. In decoding terms, adaptation is equivalent to the 
application of APP (A Posteriori Probability) decoding techniques. In next section, we 
will analyze APP decoding methods for T -repetition codes. For sake of brevity, we 
refer the reader to original Massey's doctoral dissertation [9] for background on APP 
methods. 



3.1 Threshold Decoding for T - Repetition Codes 



Let us consider a simple T -repetition code and the threshold-decoding estimation of 
the unique information bit. A T -repetition code naturally induces the following 
trivial set of parity check equations 

A, = gj (x) - e, (x) 2<t<T (7) 

The above set is orthogonal on bit gj (in the APP sense for linear bock codes). Thus, 
we can estimate gj as follows 



g 



1 







log 



1-pi 

Pi 




( 8 ) 



0 



otherwise 
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The receiver will perform the following t^ estimates depending on the received value 
/■ j . It can be shown that 






£( 2 .,- 1 ) 



logf-"' 



V Pi 



^0 



otherwise 



( 9 ) 



Let us introduce a linear mapping ^(x) = 2x-l between binary alphabets A = { 0 , 1 } 
and A + = { -1 , 1 } ■ Then equation (9) can be expressed as 



-1 


T 

Z'-.. 




1 


otherwise 



( 10 ) 



3.2 The Repetition of Concepts and APP- Threshold Decoding 



Let us assume a teaching by repetition strategy over T units of time on fixed instance 
jc through an additive DMC channel resembling a weak learner performance. 
Therefore, the learner can now implement APP decoding in its threshold-decoding 
form in order to arrive to a final decision. Assuming transmitted and received 
concepts with output domain {-1,1} , we get 



Ci*(x) 1 


-1 


T 

1=1 


[K';D] 




1 


Otherwise 



( 11 ) 



where each p ■ is the probability that a received concept A, (v) is different from the 



transmitted concept C;(jc), \ <i<T i.e. the error probability achieved by the i-th 



WL. Denoting 



: log 




, for fixed v equation (12) is almost AdaBoost 



decision. However, two differences are observable. First, there is a factor - 

2 

difference between APP weighting factors w , and those derived from AdaBoost. 
Though this fact does not affect the final decision, its presence can be explained [10] 
by the exponential cost function used in AdaBoost instead of a Log-likelihood 
criterion. The other difference is that computation of APP weighting factors w ^ 
requires exact channel error probabilities. However, recall that in HB we always 
know the target concept at a finite set of sample points S so that we can provide a 
sample mean estimate vv , associated to each weak hypothesis h, (x) under 
distribution D , for S as follows 
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1 Es^^[c{x)=h,{x)] ( 12 ) 

^ ^ Es^,^^[c(x)* h,{x)] 



Thus if we use (35) to estimate the weighting factors w , required by the threshold 
decoding rule, the target learner will issue a final decision h j, ( X ) 



h f ( Jc ) = sign 



-h, (x) 



(13) 



which is exactly the AdaBoost decision for discrete weak hypothesis with output 
domain T = { -1 , 1 } . Concept repetition is a special case of general block concept- 
channel coding schemes. For AdaBoost like boosting algorithms, there is no way to 
construct an unbounded set of orthogonal parity checks equations for increasing T 
values. At some point dependency between distribution leads to significant correlation 
between errors so that no further improvements can be achieved. At this point, the 
best we can do is to adjust threshold coefficients i.e. „the size of the weights is more 
important than the size of the network" [11] [12]. 



4 Learning by Diversity: Recursive Models 

The decoding view for the HB problem explains simply the classic teaching by 
concept repetition strategy. In addition, it also suggests many unexplored teaching 
schemes. When learning classes which are too complex, it would be useful to think in 
some kind of target concept expansion so that any concept can be expressed and 
reconstructed from a fixed number of simple base concepts i.e. a learning by diversity 
model. Let c{x ) be a target concept admitting some kind of expansion 

c(x) = Span(ci {x),...Ck (x)) (14) 

Then, a teaching strategy for a weak learner may be viewed as the transmission of 
a frame of base concepts over a noisy channel. Each concept codeword in the frame 
must be decoded first in order to reconstruct to whole target concept. For each Span 
definition, a particular learning algorithm would be obtained. A good example can be 
found in the ECOC [13] approach for M -class problems, where a M -valued target 
function is broken down into k y log 2 M binary base functions through an Error 
Correcting Output enCoding scheme (ECOC). Thus, the selected encoding scheme 
implicitly defines the components in the Span expansion whilst the Minimum 
Distance Hamming criterion defines the Span ' recombination function. An essential 
limitation in ECOC behavior is the increasing coding length requirement for better 
generalization performance and off course for growing M values. In fact, this is well 
known problem in coding theory, where the exchange between block-coding length 
and error rates has been largely treated. Coding theory has been able to find a 
promising solution for such problem under the theory of Recursive Error Correcting 
Codes so that alternative low-complexity ECOC extensions could be derived from 
them. 
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Definition 1: A bipartite graph is one in which the nodes can be partitioned in two 
different classes. An edge may connect nodes of distinct classes but there are no edges 
connecting nodes of the same class. 

Definition 2: A Recursive Error Correcting Code (RECC) is a code constructed from 
a number of similar and simple component subcodes. A RECC in its simple form can 
be described by bipartite graphs known as Tanner graphs [14]. 

Let us consider a simple example of a RECC constructed from two parity 

check subcodes S ^ and S 2 ■ Codewords in this simple RECC are all binary 6-tuples, 
which simultaneously verified parity restrictions imposed by each component 
subcode. 




Fig. 1. Tanner graphs for a simple RECC built from two component subcodes 

The main objective of defining codes in terms of component subcodes is to reduce 
decoding complexity. A RECC can be decoded by an ensemble of decoding 
processes, each one at a component subcode (check nodes in Tanner graph terms) and 
later exchange of information between them on bits (local variables in Tanner graph 
terms) they have in common. It should be note, that this decoding approach requires 
the implementation of local APP decoding methods, because these are the only 
methods that give us probabilistic estimation of code bits. For purposes of learning in 
high output domains, the set of code bits would define a set of binary weak learners 
with their corresponding error rates with communicating socket points at the 
component subcodes. The essential fact about Tanner graph representations is that 
they imply the existence of a message-passage mechanism between check nodes and 
local variables [15] and this is precisely what we need for the design of low- 
complexity learning algorithms in high dimensional spaces. 

Now, let us consider the learning problem for target classes defined for HDFS. A 
common sense strategy would be to choose a reduced and informative number of 
features and teach through an associated attribute-filtered version of S . The problem 
with this strategy is the prior knowledge requirement. It may happen that we do not 
have such prior knowledge or even there is no reduced set of informative attributes. In 
such cases, an alternative strategy can still be applied. We can expose different, 
perhaps random, attribute filtered versions of 5 to a set of weak learners and then let 
them exchange information in order to encourage their common learning 
performance. It happens that this ML strategy can also be modeled by Tanner graphs 
under theory of RECCs. 
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4.1 Boosting Algorithms in HDFS 

Let C : X ^ K a target class, the problem is to reduce learning complexity because of 
the number q of features in X . In the absence of prior knowledge about the relevant 
features, we may take the sample 5 and perform d Random Feature Filtering (RFF) 
steps over the set of feature vectors available in the training set 5 . We are thinking in 

a low density (with respect to q ) binary random attribute filter matrix H 

characterized by the presence of k ones per row and j ones per column similar to 
parity matrix of a Low Density Parity Check Code (LDPC) code. [16][17]. The whole 
filtering process implemented by H over X can be modeled using a Tanner graph as 
it is shown in Fig. 2 




Fig. 2. RFF -q = 6, d = 4,k = 3, j = 2 



The RFF process creates a set of input spaces Xj jX^ \ = k<-<q, 
l< r <d . Therefore, from a sample S we can obtain a vector of samples S h with 
components being RFF versions of S . Because of the underlying random and sparse 
structure of the diversity matrix H , the sample components in Sh may be assumed 
as being independent. The, we can apply a set of d supervised learning process so that 
each weak learning algorithm L ^ over a sample S r issues a weak hypothesis 
hr{x) {r = \,..,d). As each learner sees only a fraction of the feature space X , its 
decisions suffer from some kind of distortion due to the filtering process. 



4.2 Recursive Classifiers in HDFS 

Assume that we have a teacher, a target concept and two different weak learners. 
Differences between learners arise because of their distinct criterions about the most 
important features defining a target concept. The same target concept is taught to each 
weak learner by concept repetition over d times. After the teacher has completed his 
class, each learner will be asked about the target concept. Both learners are allowed to 
exchange information before issuing a final decision. The first learner will issue a first 
decision after d units of time and then will help the second learner in order to 
improve its decision. The second learner will repeat the process and will help the first 
after d units of time... From the theory of RECCs, the proposed architecture is 
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simplified learning version of a turbo coding scheme [18]. In 
learning model is shown by a Tanner graph representation. 



Fig. the proposed 




Fig. 3. Turbo_Learn exchange of information by means of a Tanner graph 



Propagation of messages begins at the first graph from left to right until reaching 
check node and then continues to the second graph structure. This structure can be 
generalized using T parallel boosting units, thus defining the Turbo_Learn algorithm 



TurboLearn Algorithm 

Input: LDPC matrix H , S / | 5 |= m , weak learner WL , d, T 

1 

Initialization: DP{i) = — , \<i<m 

m 

For each \<t<T , for each l<r<d , do 
hr (x)= WL S r } , Choose 

/ \ t ! \ exp(-a' -y; -hi {xj )) 

Dr{i) = Dr[i) ^ ^ , \<i<m 

z^r 

Dp(i) = Dr^^{i) being p = ( r + 1 ) mod( + 1 ) , \<i<m 



Output: h f (r ) = sign 

End 



a'r-hUx) 

V f = l r = l 



T d 

Theorem 1: The training error S in Turbo_Learn is at most n z; 

r=l r=l 

Proof: The proof is almost the same as that in AdaBoost [1]. By unraveling the 
expression of D/(i) after T boosting steps. 



To conclude, in Fig. 4 we present a representative Turbo_Learn test error response 
through the Vote dataset (UCI Repository of ML). We used methods in [19] to generate 
H matrixes and a Decision Stump algorithm as base learner. 
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Fig. 4. : Vote Domain (binary problem, 16 attributes). Turbo_Learn test error for diversity d , 
T outer boosting steps and a Decision Stump weak learner -TL (d, T)+DS 

The obtained response shows that boosting is achievable but it depends on the balance 
between diversity d and attribute density k/q. The latter parameter clearly regulates 
algorithm performance, because of its intrinsic effect in the independence assumption 
over samples. Furthermore, only a small number of outer boosting steps are required 
in order to improve overall learning. 



5 Conclusions and Future Work 

The main contribution of this work is the introduction of recursive coding related 
models for the analysis and design of practical boosting algorithms in high 
dimensional spaces. A number of directions for further work and research stand out. 
It is necessary to extend our toy-examples to high dimensional data and to analyze 
how random filtering parameters affect convergence properties. An important 
research line is the development of recursive decoding models for high M- 
classification problems as a generalization of the ECOC approach. It should be note 
that alternative binary boosting schemes could be constructed if we replace T- 
repetition block codes by convolutional schemes with rates y i.e. the target concept 

and T-1 different parity concepts. It would be interesting to analyze the feasibility of 
this approach. 
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Abstract. This paper investigates a methodology for effective model 
selection of cost-sensitive boosting algorithms. In many real situations, 
e.g. for automated medical diagnosis, it is crucial to tune the classifi- 
cation performance towards the sensitivity and specificity required by 
the user. To this purpose, for binary classification problems, we have 
designed a cost-sensitive variant of AdaBoost where (1) the model error 
function is weighted with separate costs for errors (false negative and 
false positives) in the two classes, and (2) the weights are updated dif- 
ferently for negatives and positives at each boosting step. Finally, (3) a 
practical search procedure allows to get into or as close as possible to the 
sensitivity and specificity constraints without an extensive tabulation of 
the ROC curve. This off-the-shelf methodology was applied for the au- 
tomatic diagnosis of melanoma on a set of 152 skin lesions described by 
geometric and colorimetric features, out-performing, on the same data 
set, skilled dermatologists and a specialized automatic system based on 
a multiple classifier combination. 



1 Introduction 

In constructing a predictive classification tool for a real-world application, e.g. an 
automated diagnosis system, it is now well recognized that misclassification costs 
have to be incorporated into the learning process HCg. Still it is much less clear 
how to drive the system towards the optimal performance in terms of sensitivity 
and specificity, which are the measures typically required in the practical case. 
Given a good cost-sensitive algorithm, a particular choice of the costs will lead 
to build a model characterized by a pair of sensitivity and specificity values, 
e.g. a point on the ROC space. But very often the costs (e.g of a false negative 
or of a false positive in a binary medical classification problem) are estimated 
approximately, or at least less definitely of the minimum acceptable sensitivity 
and specificity rates, thus one is left with the doubt that modifying the costs 
might result more effective than improving the model in order to reach the 
minimum specified performance. Most likely, the learning procedure will also 
depend from the prior probabilities of the classes, thus adding training material 
at fixed costs will produce a model with different sensitivity and specificity. The 
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main question we want to address in this paper is thus how to develop a good 
cost-sensitive classification algorithm, which is independent as much as possible 
from cost availability and class imbalance, and which does not require dense 
sampling of the ROC curves for each training data set in order to satisfy or to 
go as close as possible to the performance constrains (in terms of sensitivity and 
specificity) . 

As a cost-sensitive algorithm, we will discuss in this paper a variant of the 
AdaBoost algorithm jHj: basic AdaBoost allows us to build systems with high ac- 
curacy (low misclassification error) and, although misclassification costs were not 
originally considered for the training phase, we can build a good cost-sensitive 
variant which separately optimize the margins for the two classes. In our vari- 
ant, cost-sensitive boosting is achieved by (A) weighting the model error function 
with separate costs for false negative and false positives errors, and (B) updating 
the weights differently for negatives and positives at each boosting step. 

Similar approaches have been described elsewhere. In particular, a cost- 
sensitive variant of AdaBoost was adopted for AdaCost 0: based on the as- 
sumption that a misclassification cost factor has to be assigned for each training 
data, the weights are increased in case of misclassification or decreased other- 
wise according to a non negative function of the costs. A different model error 
function than in (A) is considered, as we focus on explicit weighting in terms 
of sensitivity and specificity. Karakoulas and Shawe-Taylor El, have also intro- 
duced a similar approach based on misclassification costs constant for all the 
samples in a class. The procedure leads to increase the weights of false negatives 
more than false positives and, differently from our approach, to decrease the 
weights of true positives more than true negatives. 

We have also included in our methodology a practical search procedure to 
get into, or as close as possible to, a target region in the sensitivity-specificity 
space. The aim is to wrap all of the cost-sensitive boosting learning cycle with 
a model selection procedure. For example, in the melanoma diagnosis problem, 
we will simulate the development of a model with sensitivity greater than 0.95 
and specificity greater than .50 (a sensible request for assisting the screening 
of skin lesions). The search procedure allows us to satisfy for the first time the 
sensitivity and specificity constrains, without a tabulation of the whole curve 
in the ROC space. Our variant of AdaBoost allowed a remarkable improvement 
over previous results on the same data set obtained with a combination of classi- 
fiers specifically designed for the task |2| . An improvement was also found in the 
control of variability (standard deviation of error estimates). In the melanoma 
diagnosis application, our combined strategy resulted more effective than apply- 
ing an external cost criterion to AdaBoost (as documented similarly in PI)- 

The paper is organized as follows. The next Section |2| briefly introduces the 
classification problem which inspired our approach. Our cost-sensitive variant 
of AdaBoost, including the search procedure, is described in Section 0 The 
approach is evaluated on the melanoma data in Section 21 Section |3 concludes 
the paper. 
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2 The MEDS Melanoma Data 

As described in |3|, the MEDS data base is composed by 152 digital epilumines- 
cence microscopy images (D-ELM) of skin lesions, acquired at the Department of 
Dermatology of Santa Chiara Hospital, Trento. Image processing of D-ELM data 
produces 5 geometric- morphologic and 33 colorimetric features for each image, 
for a total of 38 features. D-ELM images and similar features were also used for 
automated classification in 0. According to a subsequent histological analysis, 
the MEDS data base includes 42 malign lesions (melanomas: positive examples) 
and 110 naevi (negative examples). In 0, different classifiers and a panel of 8 
dermatologists were compared reproducing a tele-dermatology set-up. For re- 
sult comparisons, we here use the same experimental structure of the previous 
study, consisting in a 10-fold cross validation for the estimates of sensitivity and 
specificity of the classification systems and of the physicians. 

3 The SSTBoost Cost-Sensitive Procedure 

3.1 AdaBoost and SSTBoost 

Let us start with the basic Adaboost 0. Given a training data set L = {{xi,yi)}, 
with J = 1, ..., N , where the Xi are input vectors (numerical, categorical or mixed) 
and the yi are class labels taking values -1 or 1, the discrete Adaboost classifica- 
tion model is defined as the sign of an incremental linear combination of different 
realizations of a base classifier, each one trained on a weighted version of L, in- 
creasing the weights for the samples currently misclassified. Alternatively, if the 
base model does not accept internally weights, it can be trained over weighted 
bootstrap versions of L. The AdaBoost procedure for a combination H oiT base 
classifiers is summarized in Box 1. 



— Given L = {(x^, yd}i=i,....v C A x {-1, -hi} 

— Initialize Di{i) = 1/iV 

— For t = 1, ..., T : 

1. Train the base classifier h using distribution Dt- 

2. Get hypothesis ht : X ^ {— 1,4-1} 

3. Gompute model error ct = Dt{i)0[yiht{xi) = —1] 
where 0[P] returns 1 if predicate P = true, 0 otherwise. 

4. Ghoose at = ^ In — 

2 V 

5. Update = = where Zt is a normalization 

Zt 

factor chosen so that Dt+i will be a distribution 

/ T 

— Output the final hypothesis: H(x) = sign E atht{x) 

\t=i 




Box 1: The AdaBoost algorithm 
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— Define c,- = 



— Given L = {(x^, y*)}i=i,...,Ar C X x {-1, -hi} 

— Given cost parameter w € [0, 2] 
w\iyi = -hi 

2 - w\iyi = -I 

— Initialize Di{i) = 1/N 

— For t = 1, T : 

1. Train base classifier h using distribution Dt- 

2. Get hypothesis ht : X ^ {— 1,-hlj 

3. Gompute model error = (1 — Se)Tr^iw -h (1 — 5'p)7r_i(2 — w) 



4. Ghoose at = ^ In (- — — 

2 

■ ' 



5. Update Dt+\{i) = 



if yiht{xi) = -hi 
if Vihtixi) = -1 



where Zt is a normalization factor chosen so that Dt+i will be a 
distribution 

Output the final hypothesis: H^{x) = sign at/it(a;)^ 



Box 2: The SSTBoost algorithm: internal learning procedures 



To illustrate a typical situation, maximal decision trees implemented follow- 
ing the classic reference ^ can be considered as base classifiers ht- In many 
applications, using unpruned trees not only avoids introducing an additional 
metaparameter (the regularization one) in the system: but also, maximal trees 
give overall optimal or suboptimal results with boosting when there is enough 
interaction between variables, as discussed in mnD. 

The definition of the model error e in AdaBoost (Box 1) does not differentiate 
the costs of misclassification for training data from different classes. In Box 2 
we introduce a variant of AdaBoost (Sensitivity-specificity Tuning Boosting: 
SSTBoost) which takes into account costs at two different levels. Firstly, given 
class priors and costs (or losses) Ci of a misclassification for class i S {—1, +1}, 
we propose to consider the following weighted version of the model error: 



e = (1 - S'e)7T+ic+i -h (1 - S'p)7T_ic_i. (1) 

As discussed in P, rather than considering separately the values of the two 
c_i and c+i, it is more convenient to consider the cost ratio or to impose 
a constraint c+i -I- c_i = cost. In a more complete view, imbalance between 
classes may also play an important function, not necessarily correlated with the 
cost ratio: in these cases one should consider the extended cost ratio 

C_i7T_i 

We do not consider this extension in this paper. In Box 2, we introduce the 
cost parameter w G [0,2] such that c+i = w and c_i = 2 — w: clearly, w = 1 
corresponds to the classical AdaBoost model, while values of u> > 1 will increase 
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contribution to error by misclassification of positive cases, and vice versa for 
ui < 1. In particular, suppose w > 1: a greater weight at will be therefore 
assigned to the models with higher sensitivity. 



yiht{xi) = +1 



Viht{xi) = -1 

^ - - -y 



Dt{i) 

Fig. 1. Weight updates. 



On a more local scale. Step 5 in Box 2 introduces a second variation to AdaBoost 
in the weight updating procedure. For w > 1, the weights of the misclassified 
positive samples will be increased more than those of misclassified negatives, and 
the weights of the correctly classified negative samples will be decreased more 
than those positive and correctly classified (Figure Q). In order to induce higher 
sensitivity, the procedure therefore puts more attention on the hardest positive 
examples. In terms of margin analysis, the result of the procedure is to increase 
the margin of the positive samples more than for the negative ones. According 
to the results in m, it follows that the measure of confidence in the prediction 
is higher for the positive samples, i.e. for re > 1 the final SSTBoost model has 
been trained for generalizing with higher sensitivity. 

This property was tested on the MEDS melanoma data base: in the left panel 
of Figure 13 the cumulative margin distribution (data from both classes) is shown 
for three different values of the misclassification costs. For w = I (equivalent to 
the AdaBoost algorithm), we can see that the margins are approximately con- 
centrated between 0.5 and 0.8. For values of w different from 1, a gap in the 
margin distribution is observed. In particular, for w = 1.34375 the cumulative 
distribution remains flat approximately from 0.3 to 0.8 (solid curve in the left 
panel of Figure EJ- The right panel of Figure El clarifies how the gap is originated 
for this value of the cost parameter: training has aggressively increased the mar- 
gin of the positive samples (always greater than 0.8), while the margin of the 
negative samples remains lower than 0.3. 



Tuning Cost-Sensitive Boosting and Its Application to Melanoma Diagnosis 



37 




111111 




1.0 



0.8 



0.6 



0.4 



0.2 



0.0 



0.0 0.2 0.4 0.6 0.8 1.0 



Margin 



Fig. 2. Left panel: cumulative margin distribution for different values of the misclas- 
sification costs. For values of the misclassification cost w different from -|-1, a gap in 
the margin distribution is observable. Right panel: cumulative margin distribution of 
positive and negative samples for w = 1.34375. 



3.2 The SSTBoost Search Procedure 

The cost-sensitive procedure discussed in Section supports the development 
of classification models differently polarized towards sensitivity or specificity. 
We describe now a procedure for the automatic selection of an optimal cost 
parameter w* in order to satisfy or to get as close as possible to admissible 
sensitivity and specificity constraints. The idea is to take advantage of the cost- 
sensitive learning variant described in Box 2 and at the same time to avoid a 
manual tuning of w or an extensive tabulation of the possible Hy, in order to 
reach the minimal operative requirements. If A is a target region in the ROC 
space, i.e. A is a compact subset of [0, 1] x [0, 1], the constraints are satisfied for 
w* such that (1 — Sp{w*), Se{w*)) € A, where Se{w) and Sp{w) are predictive 
estimates (e.g. from cross-validation over the training data L) of the sensitivity 
and specificity of the model Hyy computed according to Box 2. The goal is then 
the minimization of the distance between the ROC curve and the compact set 
A, where the ROC curve is defined as (j)H ■ [0, 2] — )> 

= {I- Sp{w),Se{w)). (2) 

The problem can be addressed as a minimization problem of a function of one 
real variable. Let A : [0,2] — >■ JR'*' be defined as 

Alw) = dist((j)H{w), A) = min \\(j)H{w) — ajl. (3) 

aeA 
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~ Given a target region A 

Set Wi = 1, Wfyiin — Oj '^max — 2 

— Train (x) as in Box 2 and use cross-validation to estimate 4>h{wi) 

— For i = 1, M 

1. If 4>H{wi) G Failure Region 

Abort 

Elself (pH{wi) G Low Specificity 
Wi+I = l/2{Wmax - Wi) 

'^min — 

Elself (pniwi) G Low Sensitivity 
Wi+I = l/2{Wi - Wmin) 

'^max — '^i 

Elself G A 

Return w = Wi 

Endlf 

2. Train and use cross-validation to estimate 4>H{wi+i) 

EndFor 




Box 3: The SSTBoost tuning procedure 



The problem admits a solution, not necessarily unique: the possible optimal cost 
parameters are selected by rc = argmin^A(w). In practice, constraints are likely 
to be of the type {Se > a AND Sp > b). In this case, A is a rectangular subset 
and the two components of (j)H are increasing, so numerous search algorithms 
can be applied to quickly individuate an optimal cost parameter w. A simple 
but effective bisection method is described in Box 3. The algorithm fails when 
2]) n A = 0, otherwise one has to choose one of the w such that 4>h{w) G 
A according to some super-optimality criterion, or just stopping at the first 
admissible w. It is clear that there are several effective alternatives for the search 
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procedure, all leading to a fast convergence towards A, or at least as near to 
A as possible, particulary without strict hypotheses over the cost parameter. 
However, it must be taken into account that the (jjniw) is only estimated (by 
cross-validation, in our example), and thus its smoothness is not necessarily 
assured. In case information about the Ci costs resulted available, as discussed 
in P, the search can be constrained within a smaller interval I C [0,2]. Finally, 
a non-euclidean distance may be considered for the dist function in Eq. 0 



4 Application to the Melanoma Data 

We applied the procedure described in Boxes 2 and 3 to develop an effective 
model for early melanoma diagnosis. The goal was the development of a tool for 
supporting the discrimination between malign and benign lesions in accordance 
with application-specific constraints based on the MEDS data set described in 
Section 2. As the system was designed to support early diagnosis in a screening 





0.0 0.2 0.4 0.6 0.8 1.0 



k 



Fig. 3. Left panel: ROC curve for the SST boosting model applied to the melanoma 
MEDS data. The dashed rectangle represent the target region and the points over the 
curve indicate the value of the cost w during the optimization phase as described in 
Box 3. Right panel: distribntion of the k statistic for pairs of models from 

the tuning procedure. 



modality, it was required to recognize the maximum possible number of malign 
lesions, accepting a specificity of at least 0.5: in the real case, a non-expert 
clinician will perform the first visit, and all the patients with a suspect melanoma 
will be re-evaluated by a specialist. The target region A in the ROC space is 
therefore defined as Se > 0.95 and 1 — Sp < 0.5. The target region corresponds to 
the shaded rectangle in the left panel of Figure 3. The ROC curve (estimated by 
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Table 1. For each classifier and combination of classifiers, sensitivity and specificity, 
with the standard deviation, are shown. The asterisk indicates results from P|. 



Classifier Sens. ± SD Spec. ± SD 



Discr. Ana.* 


0.65 


± 


0.30 


0.83 


± 


0.11 


C4.5* 


0.64 


± 


0.28 


0.84 


± 


0.05 


1-NN* 


0.68 


± 


0.30 


0.90 


± 


0.10 


9-NN* 


0.41 


± 


0.25 


0.96 


± 


0.04 


Discr. Ana. -|- C4.5 + 1-NN* 


0.86 


± 


0.32 


0.64 


± 


0.11 


Discr. Ana. -|- C4.5 + 9-NN* 


0.84 


± 


0.32 


0.71 


± 


0.12 


AdaBoost 


0.49 


± 


0.32 


0.97 


± 


0.04 


th-AdaBoost 


0.92 


± 


0.12 


0.70 


± 


0.14 


SSTBoost 


0.97 


± 


0.07 


0.54 


± 


0.18 



cross-validation for different H{w) models) is also plotted: the curve is obtained 
by tabulation of 4>h{w), following the SSTBoost procedure in Box2. The 6 circles 
indicate the performance for models ml, . . . ,m6: the models were obtained as 
steps of the tuning procedure in Box 3. This experiment shows that we can avoid 
computing a dense estimate of the ROC curve and leave the algorithm self-tune 
in order to reach the target region in 6 convergence steps. 

An important issue in model selection procedures is to test the effective dif- 
ference between different proposed classifiers. Given two models and a common 
test set, we can compute the k statistic to test the difference between the two; 
for details, see |^. For k = 0 the agreement between classifiers equals that ex- 
pected by chance, while k = 1 indicates complete agreement between the two 
models. The distribution of k statistic at each step of the tuning algorithm, i.e. 

TOi+i) is shown in the right panel of Figure 3. For each pair of models 
(i.e. of pairs of cost parameters), the the k statistic is computed on each of the 
10 cross-validation test sets. It can be observed that diversity between models 
progressively reduces at each step: at the end, the median k value is greater than 
0.8 indicating very small changes in the model performance. 

Table □ summarizes performance results over the MEDS data set for classi- 
fiers developed in a previous study 0, and the different variants of AdaBoost 
studied in this paper, including SSTBoost. In ^j, different classifiers were devel- 
oped, and the most interesting results were obtained by combination. In partic- 
ular, the performance closest to the constraints was obtained for a combination 
of Discriminant Analysis, C4.5 and Nearest Neighbors. That performance was 
comparable with the average over a panel of 8 expert dermatologists, on the 
same experimental conditions. The value for AdaBoost reported in Table [D is 
clearly unbalanced towards specificity: however, a family of models was obtained 
from the AdaBoost models by thresholding the margin distributions for the two 
output classes and then choosing an optimal model th- AdaBoost as a function 
of the threshold. The performance for SSTBoost is also reported in Tabled (see 
also Figure 3), and it indicates the overall best results. Not only was the tar- 
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get region A reached in only 6 steps, but also the model variability was very 
moderate in comparison with the other models developed in this and in the pre- 
vious study. It is interesting to note that the improvement of SSTBoost over 
the th-AdaBoost seems to confirm the consideration on the differences between 
MetaCost architecture for boosting and AdaCost reported in M 

5 Conclusions 

This paper describes a methodology for automatically building a model with the 
required minimal performance in terms of sensitivity and specificity. 

The introduction of a cost parameter w both within the estimated error 
function as well as within the weight updating of AdaBoost (as in (Z|) allows us 
to effectively increase the margin of the predictions of one class with respect to 
the other. As a consequence, any admissible choice of this parameter leads to 
models characterized by different sensitivity-specificity pairs. We also indicate 
a procedure for selecting the optimal w, i.e. such that the corresponding model 
reaches or goes as close as possible to the target region defined in terms of 
required sensitivity and specificity. We have given only a basic example of a 
self-tuning procedure for cost-sensitive model selection: more elaborate search 
procedures may be considered within this approach. 
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Abstract. This work proposes a novel method for constructing RBF 
networks, based on boosting. The task assigned to the base learner is to 
select a RBF, while the boosting algorithm combines linearly the different 
RBFs. For each iteration of boosting a new nenron is incorporated into 
the network. 

The method for selecting each RBF is based on randomly selecting sev- 
eral examples as the centers, considering the distances to these center as 
attributes of the examples and selecting the best split on one of these 
attributes. This selection of the best split is done in the same way than 
in the construction of decision trees. The RBF is computed from the 
center (attribnte) and threshold selected. 

This work is not about using RBFNs as base learners for boosting, but 
about constructing RBFNs by boosting. 



1 Introduction 

This work is a follow-up of our research in boosting similarity literals for time 
series classification CHI. In that work, following the good results of boosting 
very simple classifiers (i.e., stumps) for several data sets jO|, we proposed to use 
similarity literals as base classifiers. The format of these literals were: 

[ not ] <distance>_\e{ Example, Reference, Attributes, Value ) 

which is true if the <distance> between the Example considered and another 
Reference example, restricted to the Attributes considered, is less or equal (Je) 
than Value. 

Normally, the parameter Attributes will include all the attributes of the ex- 
amples, and it would be unnecessary. The reason for its inclusion is because for 
several types of machine learning problems it is natural to group the attributes of 
the examples. An important example is the case of multivariate time series. For 
these problems, there are several series (e.g. x,y), and each series is composed 

* This work has been supported by the Spanish CYCIT project TAP 99-0344 and the 
“Junta de Castilla y Leon” project VAlOl/01 
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by several values (e.g. In this case it is interesting the possibility of 

using the distance between each series independently. 

Each base classifier was only one literal, and their result was simply true or 
false. One of the improvements to the original AdaBoost algorithm is the use 
of confidence based predictions where each base classifier also returns, for 
each example, a confidence (a real number) on its prediction. 

A natural question is how to combine confidence based predictions with sim- 
ilarity literals. The first option was to consider a given literal as a boolean at- 
tribute and, using the methods of WA for domain-partitioning weak classifiers, 
to assign a confidence value corresponding to the values true and false of the 
literal. Nevertheless, when using distance literals it is natural to use, somehow, 
the value of the distance for the current example, to the reference example, and 
the threshold value to obtain a confidence. In fact, an obvious option is a radial 
basis function. On the other hand, the result of AdaBoost is a linear combina- 
tion of the predictions of the base classifiers, and if the base classifiers are RBFs, 
then the result of boosting is a RBF Network H3E|. 

This work is also related with the methods for constructing RBFNs from 
decision trees HH. These works share the idea of constructing the network from 
a symbolic machine learning method. For instance, the selection of each RBF 
is based on the one for the split of a node in a decision tree. A difference with 
that methods is that in the case of the decision trees there are two steps: first, 
constructing the symbolic classifier and second, “upgrading” it to a network. 
In our case, although there are also two parts, the base learner and the proper 
boosting method, they work in cooperation. 

The rest of the paper is organized as follows. Section 0 describes the oper- 
ation of the base learner. The concrete details of the boosting variant used are 
described in section 0 Section 0 presents experimental results when using this 
method. Finally, we give some concluding remarks in section 0 



2 The Base Classifiers 

2.1 Literals Selection 

The base learner works as follows. First, several examples are selected, randomly, 
as possible references. The number of reference examples considered (r) is a 
parameter, and even could be possible to use different number of positive and 
negative examples. 

For each reference example, the distance to all the other training examples (e) 
is computed. The time necessary for this process is et(n), where t(n) is the time 
necessary for calculating the distances between two examples with n attributes. 
In the case of the euclidean distance, t{n) G 0{n). 

Then the best threshold for the distances is computed in a similar way as 
done with decision trees. First, all the distances are sorted (time elge). All the 
values are considered, from left to right, keeping into account the number and 
weight of positive and negative examples at the left from the current value. For 
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each distance value it is computed the error of selecting this threshold. This can 
be done, for each value, in 0(1) because it only involve to calculate a function 
of the weight of positive and negative examples at the left and at the right of 
the threshold. For e distances the time necessary is 0(e), which is smaller than 
0(elge ). Hence, the base learner requires a time of re{t{n) + Ige). 

2.2 Assigning Confidences 

Given a literal <distance>Je{ Example, Reference, Attributes, Value ), the RBF 
selected is 



where x is the Example, c the Reference example, t the threshold Value and d,A 
the <distance> restricted to the Attributes A. 

This function has the following properties: 

— h{c) = 1 

— h{x) = 0 if dA{x, c) = t 

1 < h{x) < 1, given dA{x, c) > 0 

— This function monotonically decreases as dA{x,c) increases 

If the literal is negated, then the function is multiplied by —1. These functions 
are radial basis functions HSl, and a linear combination of them is a RBFN. 

3 Boosting 

The combination of several classifiers, ensembles, is a natural way of increasing 
the accuracy with respect to the original classifiers. One of the most popular 
methods for creating ensembles is boosting EH, a family of methods, of which 
AdaBoost is the most prominent member. They work by assigning a weight to 
each example. Initially, all the examples have the same weight. In each iteration a 
base classifier is constructed, according to the distribution of weights. Afterward, 
the weight of each example is readjusted, based on the correctness of the class 
assigned to the example by the base classifier. The final result is obtained by 
weighted votes of the base classifiers. 

The following sections give some details about the version of boosting used 
in this work. 

3.1 Selecting a 

In E2 several methods are proposed for selecting the weight (a) associated to 
each base classifier. The better value for a is obtained by minimizing 




( 1 ) 



Z = Y,D{i)e-^^^ 
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Where D{i) is the weight of the example Xi, Ui = yih{xi), yt G { — 1,+1} is the 
class of the example and h{xi) is the confidence given by the base classifier to 
the example. For the original AdaBoost, this expression is approximated, given 
■u* G [-1,+1], by 




And the minimum a for this expression was selected analytically. Nevertheless, 
they suggest that is possible to select other upper bounds, and we use 

^ D{i){uie~°‘ - Ui + 1) + ^ + iti + 1) 

which gives a tighter approximation. 

3.2 Multiclass Problems 

There are several methods, which also deal with confidences, for extend- 
ing AdaBoost to the multiclass case, such as AdaBoost. MH and Ad- 
aBoost. MR m Nevertheless, our base classifiers are binary (stumps) and 
we cannot use, directly, the variants that use multiclass base classifiers. 

On the other hand, AdaBoost.OC m can be used with any weak learner 
which can handle binary labeled data. It does not require that the weak learner 
can handle multilabeled data with high accuracy. The key idea is, in each round 
of the boosting algorithm, to select a subset of the set of labels, and train the 
binary weak learner with the examples labeled positive or negative depending if 
the original label of the example is or is not in the subset. Our implementation 
is based on a further variant of AdaBoost.OC, named AdaBoost.ECC nn, 
but dealing with confidence based predictions. 

4 Running Example 

This section shows a small example of the working of the method. The “control 
charts” data set has six classes: normal, cyclic, upward, downward, decreasing 
and increasing. The output codes version of the boosting algorithm selects a 
subset of the classes, e.g. decreasing, increasing and upward. The base learner 
selects a literal for discriminating between the classes in the subset and the rest 
of classes (i.e., normal, cyclic and downward). The literal selected is: 

euclidean_le( Example, upward_55, 116.968937 ) 

The argument Attributes is omitted, because all the attributes are used. 

The confidence function, h{x), assigned to this literal is given by Eq. [D 
Using this confidence, the boosting algorithm i) calculates the weight, a, of this 
base classifier and ii) updates the weights of the example. The learning process 
consists of repeating these steps as many times as iterations. 

The classification of an example x, consists of the repetition, for each base 
classifier of the following steps: 
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— The distance between the example x and the reference (upward_55) is calcu- 
lated. 

— A confidence h(x) is calculated using Eq. □ 

— For the classes in the corresponding subset (decreasing, increasing and up- 
ward), ah{x) is added to their weights. For the other classes, the quantity 
added is —ah{x) 

Finally, the class assigned to the example is the one with greater weight. 

5 Experimental Validation 

The characteristics of the data sets are summarized in table E The data sets 
waveform, waveform with noise M, CBF (cylinder, bell and funnel) [El and 
control charts m were already used in our work on boosting distance literals 

m- 



Table 1. Characteristics of the data sets 



Classes Examples Training / Test Attributes 



Waveform 


3 


5000 


300 / 5000 


21 


Wave -1- noise 


3 


5000 


300 / 5000 


40 


CBF 


3 


798 


10-fold CV 


128 


Control charts 


6 


600 


10-fold CV 


60 


Auslan 


10 


200 


10-fold CV 


8 X 30 



Auslan is the Australian sign language, the language of the Australian 
deaf community. Instances of the signs were collected using an instrumented 
glove m Each example is composed by 8 series The number of points in each 
example is variable, so they were reduced to 30 points (the series were divided 
in 30 segments along the time axis and the means of each segment were used as 
the values for the reduced series). 

The experiments were performed using 100 iterations in boosting and with 20 
reference examples (10 positive, 10 negative) in each iteration. For each data set, 
two distances were considered, the classical euclidean and dynamic time warping 
(DTW) PI, a distance designed for time series. For the data sets with an specified 
partition of training and test examples, the experiments were repeated 10 times. 
For the other data sets, 10-fold stratified cross-validation was used. 

Table El and figure [H resume the results. Comparing the results of euclidean 
and DTW, it is clear that the use of an adequate distance affects greatly the 
results of the classifier. The fact that euclidean is better than DTW for the 
waveform variants is due to the definition of these data sets. For them, all the 
randomness is in the vertical axis, and none in the horizontal (time) axis. 

Comparing these results with our results using boosting literals PH| (which we 
considered then rather good), shows a very clear advantage for the new method. 
Nevertheless, two issues are relevant: 
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— The results of that work were obtained for all the data sets using cross 
validation, and in this work we use the specified partition when available. 

— The number of iterations used in that work were 50, but in each iteration 
there were obtained as many literals as classes and in this one in each itera- 
tion only one literal (RBF) 



Table 2. Experimental results. 



Iterations: 


10 


20 


30 


40 


50 


60 


70 


80 


90 


100 


Wave 


Eucl. 


17.17 14.99 14.26 14.00 13.85 13.69 13.65 13.54 13.51 13.50 




DTW 


23.13 21.23 20.62 20.06 19.72 19.62 19.36 19.26 19.03 18.97 


Wave 


Eucl. 


20.13 16.09 15.15 14.75 14.46 14.44 14.40 14.29 14.33 14.29 


+ noise 


DTW 


26.59 25.26 24.73 24.01 23.69 23.60 23.52 23.15 22.99 22.89 


CBF 


Eucl. 


9.49 


8.00 


6.89 


6.89 


6.88 


6.50 


6.74 


6.00 


6.00 


6.00 




DTW 


1.76 


0.76 


0.38 


0.38 


0.12 


0.12 


0.12 


0.12 


0.12 


0.12 


Control 


Eucl. 


38.17 


8.50 


8.50 


7.67 


7.50 


8.00 


5.50 


6.17 


4.83 


4.17 


charts 


DTW 


17.67 


3.33 


1.17 


0.50 


0.17 


0.17 


0.33 


0.17 


0.17 


0.17 


Iterations: 


30 


60 


90 


120 


150 


180 


210 


240 


270 


300 


Auslan 


Eucl. 


11.00 


5.50 


3.50 


3.50 


3.00 


3.00 


3.00 


2.50 


2.50 


2.50 




DTW 


7.50 


2.50 


0.50 


2.00 


0.50 


1.50 


0.50 


1.00 


0.00 


0.00 



The Auslan data set was not used in m- This is the data set with the highest 
number of classes (10). Hence, we incremented the number of iterations for this 
data set, allowing up to 300 iterations. The result reported in m is an error of 
2.50, using event extraction, event clustering and Naive Bayes Classifiers. 

Finally, an important detail is that we are not aware of better results for these 
data sets, even when some of them (waveform variants) are used very extensively 
in the literature. Table0 shows results, from other authors, for these data sets. 



Table 3. Results of other works for the data sets. 



Data set 


Result Reference Method 


Wave 


14.30 


El 


meta decision trees: decision trees, rules learner, 
nearest neighbor & naive Bayes 


Wave -I- noise 


>16.50 


I2| 


Boosting decision trees 


CBF 


1.90 


P2I 


event extraction, event clustering & decision trees 


Auslan 


2.50 


U2| 


event extraction, event clustering & Naive Bayes 
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(a) Wave 



(b) Wave + noise 





(c) CBF (d) Control charts 





Fig. 1. Graphs of the results for the different data sets. 



6 Discussion and Further Work 

We have presented a novel method for learning RBFNs, based on boosting very 
simple classifiers. This classifiers are stumps of new attributes (distances to ref- 
erence examples) of the examples. Finally, these stumps are converted to RBFs. 
Some characteristics of the proposed method are: 

— The method is nearly parameter-free. The only clear parameter is the num- 
ber of iterations. Nevertheless, an interesting fact is that the classifiers (net- 
works) obtained with a number of iterations are included (are sub networks) 
in the ones obtained with more iterations. Hence, it is possible i) to select 
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only an initial fragment of the obtained network and ii) to continue adding 
base classifiers (neurons) to a previously obtained classifier. 

— There are also other possible parameters, such as the number of reference 
examples considered, but in the experimental validation we have fixed them 
arbitrarily and we have not try, in any way, to adjust them. 

— The selected distance could be considered as another parameter, but first, 
normally it will be the euclidean distance, with the exception of specific 
domains such as time series, where there exists more appropriate distances. 
And second, it would be a parameter of any RBFN learning method able of 
dealing with different distances. 

— Note that with our method it is possible to work with several distances si- 
multaneously. Moreover, the same distance can be used restricted to different 
subsets of attributes. 

— For the multiclass case, due to the use of AdaBoost.ECC, the weights of 
the connections between the t RBF neuron and the output neurons can only 
take two values (i.e. at and —at). This is clearly a disadvantage with respect 
to other methods, and a clear candidate for improvements. 

— Perhaps, one of the most distinctive characteristics of this method is that it 
does not uses clustering. In this way, the selection of the centers considers 
the weights of the examples in the current iteration; it is influenced by the 
evolution of the process. 

In any case, there are several ways of incorporating clustering to our method. 
One possibility would be to do an initial selection of centers by clustering, 
and to apply the proposed method with the restriction that the only possible 
centers would be the ones preselected by clustering. Another one would be to 
do clustering for each iteration, probably, for efficiency, from a small subset 
of the examples. 

— This method is, currently, very simple. It is based on boosting stumps and 
its implementation is one of the easiest among classification methods. It does 
not uses clustering techniques for selecting the centers. The mathematical 
concepts used are fairly simple, e.g. it does not uses matrices or gradients in 
any way. There is not feature selection or feature weighting, that is, all the 
attributes are considered with the same weight. 

The results obtained for the data sets are clearly very competitive with the 
results we know for this data sets. The current method has its origin in a time 
series classification system, and the selection of the data sets is biased by this 
origin. An open question is to what extent our good results are due to the use 
of an specific distance for this domain. 

The proposed method it is not based on any other method for learning RBFN, 
but it is clear that different combinations of this method with others are possible. 

In 0, it is presented a method for constructing hybrid MLP and RBF net- 
works. An interesting question is how to do, effectively, something similar with 
boosting. Especially, considering that boosting is well suited for combining differ- 
ent kinds of classifiers, as done in HZ! with similarity and interval based literals. 

The research on boosting has focused on classification, and this work follows 
this trend. On the other hand, RBFNs are frequently used for regression. Hence, 
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it is interesting to consider the construction of regression RBFNs using one of the 

variants of AdaBoost for regression, such as AdaBoost.R ina or ExpLev |H|. 
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Abstract. Multiple classifier methods are effective solutions to difficult 
pattern recognition problems. However, empirical successes and failures 
have not been completely explained. Amid the excitement and confu- 
sion, uncertainty persists in the optimality of method choices for specific 
problems due to strong data dependences of classifier performance. In 
response to this, I propose that further exploration of the methodology 
be guided by detailed descriptions of geometrical characteristics of data 
and classifier models. 



1 Introduction 

Multiple classifier systems are often practical and effective solutions for difficult 
pattern recognition tasks. The idea appeared in many names: hybrid methods, 
decision combination, multiple experts, mixture of experts, classifier ensembles, 
cooperative agents, opinion pool, sensor fusion, and more. In some areas it was 
motivated by an empirical observation that specialized classifiers often excel in 
different cases. In other areas it occurred naturally from the application context, 
such as the need to employ a variety of sensor types which induces a natural 
decomposition of the problem. There were also cases motivated by an attempt 
to escape from the burden of making a commitment to some arbitrary initial 
condition, such as the initial weights for a neural network. There were even hopes 
that any sort of randomness introduced in classifier training would produce a 
diverse collection that could perform better than a single element. 

There are many ways to use more than one classifier in a single recognition 
problem. A divide-and-conquer approach would isolate the types of input on 
which each specific classifier performs well, and direct new input accordingly. 
A sequential approach would use one classifier first, and invoke others only if 
it fails to yield a decision with sufficient confidence. All these can be said to 
be multiple classifier strategies, and have been explored to a certain extent. 
However, motivated by the above mentioned factors, most combination research 
focuses on applying all the available classifiers in parallel to the same input 
and combining their decisions. Naturally one asks, what is gained and lost in a 
parallel combination? When is it preferable to alternative approaches? 

The trend of parallel combination of many classifiers deviates from, or even 
follows an opposite philosophy of, the traditional selection approach in which one 
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evaluates the available classifiers against a representative sample and chooses 
the best one to use. Here, in essence, one abandons the attempt to find the best 
classifier, and instead, tries to use all the available ones in a smart way. This 
opposes the wisdom of economical design. But introducing needless classifiers is 
more than just harming efficiency. The agreement of two poor classifiers do not 
necessarily yield more correct decisions. And one can easily fall into a situation 
that the same training data are used to estimate an increasing and potentially 
infinite number of classifier parameters, which is not an unfamiliar trap. 

As the idea prospered, many different proposals of combination emerged. It 
almost feels like we are simply bringing the fundamental pursuit to a different 
level. Instead of looking for the best set of features and the best classifier, now 
we look for the best set of classifiers and then the best combination method. 
One can imagine very soon we will be looking for the best set of combination 
methods and then the best way to use them all ... If we do not take the chance 
to review the fundamental problems arising from this challenge, we are bound 
to be driven into such an infinite recurrence, dragging along more and more 
complicated combination schemes and theories, and graduately losing insight 
into the original problem. 

So, is classifier combination a well justified, systematic methodology, or is it 
a desperate attempt to make the most out of imperfect designs? What exactly 
is gained or lost in a combination effort? What has been achieved and what is 
still missing, and what should we do next? In this paper, I review the proposed 
methods and some of the challenges in related theories and practices, and suggest 
ways to further advance the methodology. 

2 Difficulties in Combination Theories and Practices 

The possibility, by now well supported by empirical evidences, of being able to 
go beyond the power of traditional classifiers is exciting. In pattern recognition, 
early discoveries that the combined accuracy of several classifiers can be better 
than that of each individual came as a surprise from experiments. Later studies 
revealed many alternative methods that can achieve similar effects. Proposed 
methods fall into two categories: (1) assume a given, fixed set of carefully de- 
signed and highly specialized classifiers, attempt to find an optimal combination 
of their decisions; and (2) assume a fixed decision combination function, gener- 
ate a set of mutually complementary, generic classifiers that can be combined 
to achieve optimal accuracy. We will refer to combination strategies of the first 
kind as decision optimization methods and the second kind as coverage optimiza- 
tion methods. It is also possible to apply the decision optimization methods to 
classifiers generated with the aim of coverage optimization. 



Decision Optimization 

Often in pattern recognition practices, several classifiers can be designed for 
the same problem. Combining their decisions gives opportunities of improving 
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accuracy or reliability. Choices of decision combination methods are dictated 
by several factors: what type of output the classifiers can produce, how many 
classes there are, and whether sufficient training data are available to tune the 
combination function. Table 1 summarizes the best known decision combination 
methods suitable under two contextual requirements. 



Table 1. Decision combination methods 





Resolution of belief scores 1 1 


Trainable 


binary or one 
of N decisions 


ranked lists of 
classes 


continuous prob. estimates 
or belief scores 


No 


majority, plurality vote 


Borda count 


sum, product rules 


Yes 


weighted vote 


logistic regression 


Bayes, Dempster-Shafer rules 



The idea of employing multiple experts specialized for a given task in differ- 
ent aspects is probably as old as the history of human society. But such common 
wisdom does not necessarily and immediately apply in the context of pattern 
recognition, where concepts such as differences and cooperation have specific 
meanings. Behavior of classifiers can be mathematically characterized, and with 
accuracy being an objective measure of effectiveness, the benefits of any combi- 
nation method can potentially be quantified. Reasons of having multiple sources 
of knowledge about an input pattern can be various, but whether they should 
be maintained in separate representations is never obvious. 

Even less clear is whether one should integrate the separate representations 
and compare them under a single metric, or direct them to separate classifiers 
and defer the integration until after the classifiers have processed them. Regard- 
less of the level where the integration is carried out, details of the integration 
procedure have to be stated in terms of a concise mathematical function and im- 
plemented in a well-defined algorithm. And early explorations show that there 
are vast differences in the effectiveness of different combination procedures. 

Because of different contextual requirements, not all methods can be used 
with all problems. General performance claims about a particular combination 
strategy are thus difficult to make. Evaluation of the methods is further compli- 
cated by the fact that by and large only successful experiments are published, 
and it is difficult to find limits of a method’s applicability. Nevertheless, it is 
obvious that very little can be tuned in the simple voting schemes, and in lack 
of sufficient training data, this may well be about all that can be done. With so- 
phisticated output like estimates of posterior probabilities, other than the simple 
sum, product rules or Bayes schemes, more elaborated combination-level classi- 
fiers can be applied. However, for problems with a large number of classes, the 
availability of good estimates of posteriors is a very strong assumption. Without 
sufficient training data, the estimates given by the individual classifiers are inac- 
curate, and so are the estimates at the combination level. Thus, applying overly 
sophisticated combination methods is a dangerous game. The rank combining 
schemes weaken the requirement to only preferences which are always available 
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as long as the classifiers compute any numerical similarity measure. For a very 
large number of classes and mixed type classifiers, this is an interesting mid- 
dle ground. But the linear and uniform scale of the ranks may be too crude 
an approximation, and to simplify the model, the combinations may have to be 
restricted to only a small number of top ranks. 

Hopes for a better understanding are placed on the development of good 
theories. But on the theory side, several problems persist. Most theoretical works 
suffer from failures to model various details in a classification problem. 

In behavior sciences, methods for combining votings and rankings of alter- 
native choices are referred to as social choice functions or group consensus func- 
tions. There, the focus is on obtaining a combined choice that best represents the 
voters’ preferences. There is no notion of absolute correctness of the combined 
choice. The merit of a candidate in an election is solely determined by some spe- 
cific characterization of voters’ preferences. However, in classification, there is a 
true class associated with each input that is determined regardless of the combi- 
nation mechanism. That makes a difference, since the combination function can 
be trained to optimize some objective accuracy measure. In classification, prior 
performance of the voters can also be evaluated against the objective truth, and 
based on such evaluation the combination function can be tuned. Some combi- 
nation schemes such as regression make explicit use of these evaluations. Others 
that do not use such information carry an implicit assumption that the voters 
are competent to a certain extent and that the imposed characterization of the 
voters’ preference is reasonable, either of which may not be correct. 

Moreover, in the context of pattern recognition, the individual decisions of 
the voters are never independent. They are intrinsically linked by the fact that 
they are responses to the same input pattern. The degree of agreement of the 
individual decisions on the same input case, besides characterizing the amount 
of differences among the classifiers, is also a reflection of the relative difficulty 
of the case. These two effects must be modeled separately. 

For combining estimates of posterior probabilities or belief scores represented 
in a normalized, continuous scale, Bayes decision theory and Dempster-Shafer’s 
theory of evidence dominated. Prior performances of individual classifiers can be 
embedded into the combination function in the form of estimates of correctness 
probabilities conditioned on the individual decisions. Simpler combination rules 
that do not take into account the classifiers’ prior performance were studied in 
where a justification was given in support of the sum rule that chooses the 
class maximizing the sum of individual estimates of posterior probabilities. The 
justification is from the sum rule’s relative insensitivity to local estimation errors 
when compared to the product rule. There are also other explored methods that 
essentially treat the belief scores given by the individuals as input features for a 
classifier at another level. These derive support from the classification principles. 
In statistics they are known as model-mix methods m, justified by reducing 
the bias of the combined estimator. 

Many attempts in modeling the performance of such combination schemes 
use some notion of complementariness among the component classifiers. But the 
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precise definition of it is seldom given. A very common assumption is that the 
classifiers’ decisions are statistically independent, in the sense that the proba- 
bility of a joint decision equals the product of probabilities of each individual 
classifier’s corresponding decision. But this is an imposed, very strong assump- 
tion which could be very far away from the truth. In a recognition system, 
classifier decisions are intrinsically correlated, as they respond to the same in- 
put. The correlation among the classifiers has to be measured from the data. 
Until measurements confirm the assumption of zero correlation, those theories 
dependent on it do not necessarily apply. This fact is well aware of at the level 
of feature representations, but is often neglected at the level of classifier deci- 
sions. Moreover, the correlation among different classifiers’ decisions can vary 
from input to input. Decisions may be strongly corrected only on easy cases far 
from the class boundary. So even with the same set of classifiers, correlation of 
their decisions varies across problems and subproblems according to proportions 
of easy and hard cases. It is a theoretical challenge to model such detailed data 
dependency. 



Coverage Optimization 

A system using several classifiers may not be able to achieve the highest accuracy 
for a problem if there are cases for which none of the classifiers’ decision is 
sufficiently close to being correct. Coverage optimization methods are called 
on to pursue the missing guarantee. There, the strategy is to create a set of 
classifiers, observing some specific notion of complementariness, such that they 
can yield a good final decision under the chosen combination function. Table 2 
summarizes the better known coverage optimization methods and the training 
mechanism used to introduce complementariness between the components. 



Table 2. Coverage optimization methods 



Method 


Training mechanism for introducing complementariness 


perturbation 


vary initial conditions or parameters of training process 


stacking 


train classifiers by nonoverlapping subsamples of training set 


bagging 


resample the training set by bootstrap replicates 


boosting 


resample the training set by weights evolving with accuracy 


random subspaces 


project training set to randomly chosen subspaces 


stochastic 

discrimination 


generate random kernels to measure coverage of training set 


error correction 
output coding 


force training on partial decision boundaries 



Several methods introduce complementary strengths by training component 
classifiers on different subsamples of the training set. Despite many observed em- 
pirical successes, such training set subsampling methods are paradoxical. Weak- 
ening the individual classifiers by not training or equally weighing on all avail- 
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able data is said to help avoid overfitting. At an extreme, boosting cannot run 
on classifiers perfect for the training data, as there are no errors to train addi- 
tional components. But the design is intended to make the entire system work 
well on the full training set. So do we want it or not that the classifiers perfectly 
adapt to the training data? If we want it, what is the point of deferring it to the 
level of decision combination? If we do not want it, what is the point of adding 
more and more components to improve accuracy on the full training set? Why 
should the full system treat the training set differently from the way followed 
by the component classifiers? If the training set is assumed to represent well the 
unseen cases, why would one believe that by sacrificing training set accuracy one 
can gain in testing set accuracy? If the feature space is small and the training 
sample is dense, the training set could overlap with the testing set perfectly. In 
such a case, what good will it do to deliberately sacrifice accuracy on the train- 
ing set? On the other hand, without involving the generalization performance 
in the analysis, the argument that the methods can, eventually, do perfectly on 
the training set is useless - template matching can do the job, there is no reason 
to bother with such elaborated training procedures and sacrifices. Without a 
thorough understanding of how overfitting is avoided or controlled within the 
training process, there is no guarantee on the results, and empirical evidences 
do show that these methods do not always work. 

Then there is the question of the form of the component classifiers. All these 
methods are known to work well with decision trees, though, details matter on 
the specific way data are split at each internal node. The much used notion of 
weak classifiers is not well defined. Fully split decision trees are very strong clas- 
sifiers, though, pruned or forced shallow versions have been used to some success. 
With others, like linear discriminators, things are less clear. If the component 
classifiers are too weak, given the simple decision combination function involving 
weighted or unweighted averaging, the decisions of many bad classifiers can eas- 
ily outweigh the good ones, especially in methods like boosting that focus more 
and more on the errors. And how about mixing in different types of classifiers? 

Such fundamental issues are in the midst of confusion in several communities. 
In Kleinberg offered a rigorous analysis of these issues, using a set-theoretic 
abstraction to remove all the algorithmic details of classifiers, feature extractors, 
and training procedures. It considers only the classifiers’ decision regions in the 
form of point sets, called weak models, in the feature space. A collection of 
classifiers is thus just a sample into the power set of the feature space. If the 
sample satisfies a uniformity condition, i.e., if its coverage is unbiased to any 
local region of the feature space, then a symmetry is observed between two 
probabilities (w.r.t. the feature space and w.r.t. the power set respectively) of 
the same event that a point of a particular class is covered by a component of the 
sample. Then it is shown that discrimination between classes is possible as long 
as there is some minimum difference in each component’s inclusion of points 
of different classes, which is trivial to satisfy. The symmetry translates such 
differences across different points in the space to differences among the models 
on a single point. Accuracy in classification is then governed by the law of large 
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numbers. If the sample of weak models is large, the discriminant function, defined 
on the coverage of the models on a single point and the class-specific differences 
within each model, converges to poles distinct by class with diminishing variance. 
Moreover, it is proved that the combined system maintains the projectability of 
the weak models, i.e., if each model is thick enough with respect to the spatial 
continuity of the classes, estimates of the point inclusion probabilities from the 
training set are close to the true probabilities, then the combined system would 
retain the same goodness of approximation of the estimate, which will translate, 
by the symmetry, to accuracy in classifying unseen points. 

The theory includes explicitly each of the three elements long believed to be 
important in pattern recognition: discrimination power, complementary informa- 
tion, and generalization ability. Here projectability of the weak models is a key 
element in the proof, and not an implicit side effect, as in several other theories 
attempted to explain the behavior of the coverage optimization methods. What 
is good about building the classifier on weak models instead of strong models? 
Because weak models are easier to obtain, and their smaller capacity subjects 
them less to sampling errors carried in small training sets E3 US- Why are 
many models needed? Because the method relies on the law of large numbers 
to reduce the variance of the discriminant on each single point. The uniformity 
condition specifies exactly what kind of correlation is needed among the indi- 
vidual models. Moreover, accuracy is not achieved by intentionally limiting the 
VC dimension of the complete system - the combination of many weak models 
can have very large VC dimension. It is a consequence of the symmetry relating 
probabilities in the two spaces, and the law of large numbers. It is a structural 
property of the topology. The theory succeeded in offering a complete explana- 
tion of the combined behavior of such simple classifiers. However, much remains 
to be explored in using it to predict the behavior of fewer but more sophisticated 
classifiers. 

Combination theories must deal with the dilemma of choosing between a 
probabilistic view and a geometrical view of classification, or the difficulty of 
blending the two. Many theories model a classifiers’ decision as a probabilistic 
event, and assume that decisions on each input are not related to decisions on 
others. However, in most application contexts, there is some geometrical conti- 
nuity in the feature space, such that classifiers’ decisions on neighboring points 
are likely to be similar. Some classifiers, such as decision trees, rely explicitly on 
this fact to partition the feature space into contiguous regions of the same class. 
But the notion of neighborhood is explictly used only in a few theories such as 
stochastic discrimination and consistent systems of inequalities fn\. Discussions 
on the optimal size of component decision trees barely touch on this, but are not 
followed through. Precise characterization of the problem geometry will involve 
descriptions of the fragmentation of the Bayes optimal decision regions, global 
or local linear separability, and convexity and smoothness of boundaries. Many 
of these depend the properties of a specific metric based on which the classifier 
operates. Better modeling of the geometrical behavior of classifiers is attempted 
in some neural network studies, where classifier training is seen as finding a good 
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approximation to the desired decision boundary, but integration of such models 
with the probabilistic view is not complete. 

The probabilistic view comes in because of the need to study a problem by 
random sampling due to the unavailability of complete data coverage. Then there 
is the issue of sampling density and training sample representativeness, which 
is intimately related to the classifier’s generalization ability, and in turn to its 
accuracy. If the training sample densely covers all the relevant regions in the 
feature space, many classifiers, as long as they are trainable and expandable in 
capacity, will work perfectly, and then the competition among methods is more 
on representation and execution efficiency. Say, a decision tree may be preferable 
to nearest neighbors for efficiency reasons. So the difficulty of classification is 
mostly with sparse samples - and for this reason, all theories depending on 
assumptions of infinite samples are useless. Those relying on a vague definition 
of representativeness of the training samples is not much better, as quite typically 
such representativeness is not even parameterized by the sampling size relative 
to the size of the underlying problem. 

There are vast differences due to sampling density. Consider a space where 
each point is randomly labeled as one of two classes. Whereas a dense sample 
may reveal the randomness to some extent, a two-point sample may suggest that 
the problem is linearly separable. With other less radical problems, sampling 
density affects the exact ways that isolated subclasses become connected and 
boundaries are constrained, much more than what can be captured in a collective 
description by a single count of points. Such problems can occur regardless of 
the dimensionality of the feature space, though they are more apparent in high- 
dimensional spaces where the decision boundary can vary with a larger degree 
of freedom. Observations of empirical evidences mmm suggested strongly 
that shortage of samples would ruin most promises from the classical approaches. 
This fact was addressed in many studies on error rate estimation as well CD m 

CD- 

Vapnik’s capacity theory CD CD is among the first that directly faces the 
reality of small sample effects. It provides a link between the interacting effects 
of classifier geometry and training sample size. But the VC dimension theory 
is not constructive. It gives only a loose, worst case bound on the error rate 
given the geometrical characteristics of the classifier and the sample size. The 
difficulty in tightening the bounds is because of the distribution-free arguments 
m m Nevertheless, as we have seen in the theory of stochastic discrimination, 
with a different characterization of training set representativeness, it is possible 
to show tighter error bounds without involving the VC dimension argument. 
Also, by using specific geometrical models matched to the problem, it is possible 
to overcome the infamous curse of dimensionality 

The theory is difficult because these factors interact with each other. With 
regard to the problem geometry, the classifier geometry, and the sampling and 
training processes, what exactly do we mean by saying that two or more clas- 
sifiers are independent? How about other related concepts such as correlation, 
diversity, collinearity, coincidence, and equalization? What do they mean in each 
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context where the decisions are represented as one out of many classes, as permu- 
tations of class preferences, or as continuous belief scores that are not necessarily 
probability estimates? Kleinberg’s notion of uniformity offers a rigorous defini- 
tion under the set-theoretic abstraction. How can this be generalized to other 
models of classifier decisions? The bias/variance decomposition 0 gives another 
way to relate geometry and probabilities, and has been used in analyzing decision 
boundaries of combined systems m- Though, many analyses are problematic 
due to inadequate assumptions on decision independence. 

How can one relate local, point-wise measures of classification error and their 
correlation to collective measures over the entire training set? Say, if we observe 
two classifiers agreeing 99% of the time and differing for the rest, how much can 
we infer about the overall similarity of their decisions? And how likely is the 
agreement observed in a different problem? If all the agreed cases are correct 
decisions, simply because those cases are easy for both classifiers, can we tell 
anything about whether the classifiers decide by the same mechanism? Detailed 
studies on the patterns of correlation among the classifiers are necessary to 
answer these questions m [E|. 

As one compares different approaches of combination, and considers combi- 
nation of combinations, there are a few more intriguing questions to ask: 

— If one defers the final decision and uses the output of individual classifiers 
just as some scores describing the input, are those scores different in nature 
from the feature measurements on the input? Are there intrinsic differences 
between the mappings from the input to the representations given by a 
classifier or by a feature extractor? 

— Is combination a necessity or a convenience? Is there some complementari- 
ness intrinsic to certain classifier training processes? Or is it just an easy 
way to derive a desired decision boundary? 

— Are there any commonalities among all the combination approaches? If many 
of them are found to be similarly effective, are they essentially doing the same 
thing despite superficial differences? 

— Does the hierarchy of combinations converge to a limit upon which one 
would have exhausted the knowledge contained in the training set, such that 
no further improvement in accuracy is possible? 



3 Precise Characterization of Data Dependences of 
Performance 

Many of the above questions are there because we do not yet have a detailed, 
scientific understanding of the classifier combination mechanisms. Theoretical 
explanations are often incomplete, or they have to stop at a level where the 
combinatorics defy detailed modeling. Many studies attempt to analyze classifier 
behavior for all possible problems and data distributions, which result inevitably 
in very weak performance bounds. On the other hand, empirical evidences are 
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often very specific to particular applications, and are seldom systematically doc- 
umented. Attempted comparative studies often stop on arriving at some collec- 
tive accuracy measures. But trying a method on 100 published problems and 
counting how many it wins on does not mean much, because these problem may 
all be very similar in certain aspects and may not be typical in reality. On the 
other hand, we will never have a fair sample of realistic problems because of the 
ill-definition of the set. So what can we do? 

If we have a way to characterize the problems in good relevance to classifier 
accuracy, we may hope to find certain rules relating those characteristics to the 
behavior of classifier or systems generated and combined in a specific way. Em- 
pirical observations of such relationships may point to opportunities for detailed 
analysis of underlying reasons. 

Here I am advocating a realist’s approach where selection of a classifier or a 
combination method is guided by characteristics of the data. And the data char- 
acteristics must include effects of the problem geometry and sampling density. 
Statements like “method X is of no help when the training sample is large ...” 
are overly simplifying. How large is large? An absolute number on the sample 
size means little without knowledge of the length of class boundary. 

We need much more systematic ways to characterize the problems. We need 
a language to describe the problems in aspects more relevant to the actions of 
classifiers: i.e. not merely collective descriptors such as number of classes, number 
of samples, number of dimensions, etc. We need a better understanding of the 
geometry and topology of point sets in high dimensional spaces, preservation 
of such characteristics under feature transformations and sampling processes, 
and their interaction with the primitive geometrical models used in known clas- 
sifiers. We need to measure or estimate the length and curvature of the class 
boundaries, fragmentation of the decision regions in terms of existence, size, 
and connectedness of subclasses, and the stability of these characteristics with 
respect to changes in sampling density. Some recent attempts are interesting 
starting points 0 P2| P3| |H! |21|. 



Characterization of Data Complexity 

In reality, most practical classification problems arise from nonchaotic processes 
many of which can be described by an underlying physical model. Though the 
models may contain a stochastic component, there should still exist a significant 
structure in the resulting class distributions that differs from a random labeling 
of points. An analysis of such differences will provide us with a framework in 
which one can study the behavior of specific classifiers and combination methods. 

Structured data differ from random labeling in that with random labeling 
there is no geometrical continuity or regularity based on which inferences can be 
made about labels of unseen points from the same source. On the other hand, 
automatic classification methodologies are based on the assumption that such 
learning is possible, to various degrees of difficulty. 

Obviously one practical measure of problem difficulty is the error rate of a 
chosen classifier. However, since our eventual goal is to study behavior of clas- 
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sifiers, other measures should also be explored that are independent of such 
choices. Moreover, a problem can be difficult for different reasons. Certain prob- 
lems are known to have nonzero Bayes error. There the classes are ambiguous ei- 
ther intrinsically or due to inadequate feature measurements. Others may have a 
complex class boundary and/or subclass structures. Sometimes high dimension- 
ality of the feature space and sparseness of available samples are to be blamed. 

Among the different reasons, the geometrical complexity of the class bound- 
aries is probably most ready for detailed investigation. One can choose the class 
boundary to be the simplest (of minimum measure in the feature space) decision 
boundary that minimizes Bayes error. With a complete sample, the class bound- 
ary can be characterized by its Kolmogorov complexity. A problem is complex if 
it takes a long algorithm (possibly including an enumeration of all the points and 
their labels) to describe the class boundary. This aspect of difficulty is due to 
the nature of the problem and is unrelated to the sampling process. Kolmogorov 
complexity describes the absolute amount of information in a dataset, and is 
not algorithmically computable. By geometrical complexity one focuses on de- 
scriptions of regularities and irregularities contained in the dataset in terms of 
geometrical primitives. This would be sufficient for pattern recognition where 
classifiers can also be characterized by similar geometrical terms. 

An incomplete or sparse sample adds another layer of difficulty to a discrim- 
ination problem, since an unseen point in the vicinity of some training points 
may share their class labels according to different generalization rules. In real 
world situations, often a problem becomes difficult because of a mixture of these 
two effects. Sampling density is more critical for an intrinsically complex prob- 
lem than an intrinsically simple problem (e.g. a linearly separable problem with 
wide margins). If the sample is too sparse, an intrinsically complex problem may 
appear deceptively simple. Thus, in lack of a complete sample, such measures 
of the problem complexity have to be qualified by the representativeness of the 
training set. 

Several measures of geometrical complexity were studied in HU, where it 
is shown that many real-world problems occupy a continuum between two ex- 
tremes given by random labelings (the most difficult problems) and linearly 
separable problems (the easiest). Measures known to be useful fall into several 
groups, characterizing the linearity of decision boundaries, inter- and intra- class 
point proximity, overlap of class-specific convex hulls and their projections into 
subspaces, and existence and connectivity of subclasses. In addition, interesting 
characteristics of the datasets were observed from the behavior of two primi- 
tive classifiers, i.e, nearest neighbor and linear discriminant (obtained by linear 
programming minimizing sum of distances of error points to the separating hy- 
perplane). Their error rates and measures of the intersection of their error points 
with the class-specific convex hulls give some hints to the problem geometry. Ta- 
ble 3 summarizes the definitions of these measures and the effects described by 
each. Details can be found in HU and HU. 

Some of these measures are correlated and can be indications of more fun- 
damental geometrical or topological characteristics. Furthermore, sensitivity of 
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Table 3. Measures of geometrical complexity 



Relevant effects 


Measures 


clustering and sphericity 


% points with associated adherence subsets retained 


inter versus intra class 
proximity 


% points on boundary expressed as between-class 
edges in a minimum spanning tree 


error rate of nearest neighbor classifier 


overlap of classes 


maximum Fisher’s discriminant ratio 


maximum (individual) feature efficiency 


ratio of average intra/inter class nearest neighbor distance 


volume of overlap region 


linear separability 


LP minimized sum of error distances 


error rate of linear classifier by LP 


curvature of boundary 


nonlinearity of linear classifier by LP 


nonlinearity of nearest neighbor classifier 



these measures with respect to the sampling density is also important. This can 
potentially be estimated by repeating the measurements on subsamples of the 
given data set. 



Coupling between Data Characteristics and Classifier Models 

Once we find ways to characterize the problems, we can then ask the question: 
what type of problems does a particular classifier or combination method work 
for? 

To investigate the optimality of match between a classifier or combination 
method and a given problem, one needs a detailed characterization of the geo- 
metrical structure of the decision regions given by the classifier, and the modifi- 
cations introduced by the combination method. For example, decision trees split- 
ting on single features divide the feature space into a set of hyper-rectangular 
cells within each classification remains invariant. Voting of two such trees fur- 
ther segments those cells into the cross-product of the two sets. How many of 
the boundary faces of neighboring cells are close to the class boundary depends 
on the geometry of the problem - whether the problem has axis-parallel, flat, or 
curved boundary surfaces, and disconnected subclasses, and to what extent. 

An example is given in HSI where the performances of two decision forest 
constructors, namely, bagging and random subspaces, were studied along with 
several measures of problem geometry. Strong correlation was observed between 
the classifier accuracies and a measure of length of class boundaries as well as a 
measure of the thickness of class manifolds. Follow-up studies showed that both 
types of forests are capable of improving over a single tree for problems of various 
levels of complexity, i.e., improvements are observed in problems with very low, 
very high, and all intermediate single tree error rates. Neither works well when 
several conditions occur jointly: a very high fraction of boundary points (70% or 
above), a close to 1 ratio of intra/inter class nearest neighbor distances, a very 
low maximum Fisher’s discriminant ratio (0.05 or below), and high nonlinearity 
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of both nearest neighbor and linear classifiers (25% or above). For these cases a 
single tree performs poorly, nor will forest methods offer much help. Easier cases 
are those with relatively compact classes, i.e., when the pretopological measure 
is lower than 80%. For such cases improvements by forests over single trees are 
less significant. The geometrical measures also reveal comparative advantages of 
each type of forests. The subsampling method is preferable when the training set 
is very sparse relative to dimensionality, especially when coupled with a close-to- 
vanishing maximum Fisher’s ratio (0.3 or below), and when the class boundary 
is highly nonlinear. Subspace forests perform better when the class boundary is 
smoother (both nearest neighbor and linear classifiers display low nonlinearity). 
If the training set is large relative to dimensionality, the subspace method is 
more preferable, even if the class distributions are long and thin. Rules similar 
to these may be observed in further studies. 

Given a problem in a fixed feature space, is there a limit on how well an 
automatically trainable classifier can do? Recall that all such methods are based 
on certain particular geometrical primitives, such as convex regions, axis parallel 
cuts, rectangular boxes, Gaussian kernels, piecewise linear boundaries, etc. It 
needs to be established that such models will fit into arbitrary shaped decision 
regions with arbitrary degree of connectedness. At which point should we say 
that it is meaningless to continue training, and that any more improvement in 
accuracy will be from luck rather than effort? VG dimension theories give us 
a certain limit, but for certain classes of problems, by exploiting the structural 
regularities and matching them to appropriate classifiers, we should be able to 
do better than that. Knowledge on the structural regularity may not be sufficient 
to remove entirely the probabilistic nature of the estimates inherent in unseen 
data, but should help in reducing the level of uncertainty. 



4 Conclusions 

I reviewed some challenges in the theories and practices of combining parallel 
classifiers. Many open questions point to a lack of insight into the intriguing 
interactions of the geometric and probabilistic characteristics of a problem and 
the classifier models. A thorough understanding of such interactions holds the 
key to further improvements of the methods. An essential need in this direction 
is to find a better set of descriptors for the geometrical structure of a problem 
in the feature space and to describe the behavior of classifiers in corresponding 
terms. These descriptors can be used to categorize the real world problems and 
their subproblems or transformations, which would permit studies and prediction 
of the behavior of various classifiers and combination methods on a whole class 
of cases. 

In addition, several methodological directions are also worth pursuing: 

— Ingenious designs of feature extractors and similarity measures that can sim- 
plify the class boundary will continue to play an important role in real appli- 
cations. Systematic searches with the same goal are even more interesting. 
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— Unsupervised learning will play an increasing role in the context of super- 
vised learning. Clustering methods will be applied more extensively and sys- 
tematically to better understand the geometry of the class boundaries and 
its sensitivity to sampling density. Others such as estimation of intrinsic and 
extrinsic dimensionalities of the classes or subclasses will also be helpful. 

— More emphasis should be put on localized (or dynamically selected) classi- 
fication methods. A blind application of everything to everything will prove 
to be inferior to localized methods. Systematic strategies should be devel- 
oped to fine tune the classifiers to the characteristics of local regions and to 
associate them with corresponding input. 

— Better understanding is needed to choose between deterministic and stochas- 
tic classifier generating methods. This will need a careful study of the exact 
role of randomization in various classifier or combination tuning processes 
and the corresponding geometrical effects. 

— New methods can come from merging the decision optimization and cover- 
age optimization strategies, such that collections of specialized classifiers can 
be enhanced by introducing additional components with enforced comple- 
mentariness, and coverage optimization methods may use more sophisticated 
decision combination schemes. 

By now, classifier combination has become a rich and exciting area with 
much proven success. It is my hope that these discussions can call for attention 
to some of the confusions and missing links in the methodology, and point out 
the more fruitful directions for further research. Some recent developments are 
already moving towards these directions. This is very encouraging. 
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Abstract. Genetic programming (CP) can automatically fuse 
given classifiers of diverse types to produce a combined classifier 
whose Receiver Operating Characteristics (ROC) are better than 
et o,/.199Sh) ’s “Maximum Realisable Receiver Operating Char- 
acteristics” (MRROC). I.e. better than their convex hull. This is 
demonstrated on a satellite image processing bench mark using Naive 
Bayes, Decision Trees (C4.5) and Clementine artificial neural networks. 



1 Introduction 



| |Scott p.t n/.1998h| has previously suggested the “Maximum Realisable Receiver 
Operating Characteristics” for a combination of classifiers is the convex hull of 
their individual ROCs. However the convex hull is not always the best that can be 
achieved lYusott et af. 19981 . Previously we showed ILangdon and Buxton2(J0T^ 
[Langdon and Buxton20UTE| in at least some cases better ROCs can be auto- 
matically produced. We extend [Langdon and Buxton2U0TE| to show, on the 
problems derived from those proposed by fScott fit n/.1998bj. that genetic pro- 
gramming can automatically fuse different classifiers trained on different data 
to yield a classifier whose ROC are better than the convex hull of the supplied 
classifier’s ROCs. 

Section 0 gives the back ground to data fusion and Sect. 0 summarises 
Scott’s work. The three classifiers are described in Sect. E] while Sect. Ode- 
scribes the satellite data. The genetic programming system and its results are 
given in Sects. O and O Finally we finish in Sects. O and O with a discus- 
sion and conclusions. Sections OO (excluding Sects. R~TI a.nd OD are similar to 
[[Langdon and Buxton2001b however the experimental work (Sect. O onwards) 
extends [Langdon and Buxton2001b| to consider fusing classifiers of very differ- 
ent types. 



2 Background 

There is considerable interest in automatic means of making large volumes of 
data intelligible to people. Arguably traditional sciences such as Astronomy, Bi- 
ology and Chemistry and branches of Industry and Commerce can now generate 
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data so cheaply that it far outstrips human resources to make sense of it. In- 
creasingly scientists and Industry are turning to their computers not only to 
generate data but to try and make sense of it. Indeed the new science of Bioin- 
formatics has arisen from the need for computer scientists and biologists to work 
together on tough, data rich problems, such as rendering protein sequence data 
useful. Of particular interest are the Pharmaceutical (drug discovery) and food 
preparation industries. 

The terms Data Mining and Knowledge Discovery are commonly used for 
the problem of getting information out of data. There are two common aims: 
1) to produce a summary of all or an interesting part of the available data 2) to 
find interesting subsets of the data buried within it. Of course these may over- 
lap. In addition to traditional techniques, a large range of “intelligent” or “soft 
computing” techniques, such as artificial neural networks, decision tables, fuzzy 
logic, radial basis functions, inductive logic programming, support vector ma- 
chines, are being increasingly used. Many of these techniques have been used in 
connection with evolutionary computation techniques such as genetic algorithms 
and genetic programming |Langdonl998| . 

We investigate ways of combining these and other classifiers with a view to 
producing one classifier which is better than each. Firstly we need to decide 
how we will measure the performance of a classifier. In practise when using 
any classifier a balance has to be chosen between missing positive examples and 
generating too many spurious alarms. Such a balancing act is not easy. Especially 
in the medical field where failing to detect a disease, such as cancer, has obvious 
consequences but raising false alarms (false positives) also has implications for 
patient well being. Receiver Operating Characteristics (ROC) curves allow us to 
show graphically the trade off each classifier makes between its “false positive 
rate” (false alarms) and its “true positive rate” |Swets et fl/.2000j . (The true 
positive rate is the fraction of all positive cases correctly classified. While the false 
positive rate is the fraction of negative cases incorrectly classified as positive). 
ROC curves are shown in Figs. 0 and 0 We treat each classifier as though it has 
a sensitivity parameter (e.g a threshold) which allows the classifier to be tuned. 
At the lowest sensitivity level the classifier produces no false alarms but detects 
no positive cases, i.e. the origin of the ROC. As the sensitivity is increased, 
the classifier detects more positive examples but may also start generating false 
alarms (false positives). Eventually the sensitivity may become so high that 
the classifier always claims each case is positive. This corresponds to both true 
positive and false positive rates being unity, i.e. the top right hand corner of the 
ROC. On average a classifier which simply makes random guesses will have an 
operating point somewhere on the line between the origin and 1,1 (cf. Fig. 

Naturally we want our classifiers to have ROC curves that come as close 
to a true positive rate of one and simultaneously a false positive rate of zero. 
In Sect. El we score each classifier by the area under its ROC curve. An ideal 
classifier has an area of one. We also require the given classifiers, not only to 
indicate which class they think a data point belongs to, but also how confident 
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they are of this. Values near zero indicate the classifier is not sure, possibly 
because the data point lies near the classifier’s decision boundary. 

Arguably the well known “boosting” techniques combine classifiers to get a 
better one. However boosting is normally applied to only one classifier and pro- 
duces improvements by iteratively retraining it. Here we will assume the classi- 
fiers we have are fixed, i.e. we do not wish to retrain them. Similarly boosting 
is normally applied by assuming the classifier is operated at a single sensitivity 
(e.g a single threshold value). This means on each retraining it produces a single 
pair of false positive and true positive rates. Which is a single point on the ROC 
rather than the curve we require. 



3 “Maximum Realisable” ROC 

Scott’s Parcel system |Scott p.t n/.1998hj followed on from work on using wrap- 
pers for feature subset selection |Kohavi and ,lohnl997| and the use of ROC 
hulls [Provost and Pawcett2()01| . However [Scott el a/.lHH8b| describe a method 
to create, from two existing classifiers, a new one whose ROC lie on a line con- 
necting the ROC of its two components. This is done by choosing one or other 
of the component classifiers at random and using its result. E.g. if we need a 
classifier whose false positive rate vs. its true positive rate lies half way between 
the ROC points of classifiers A and B, then the Scott’s composite classifier will 
randomly give the answer given by A half the time and that given by B the 
other half, see Fig. E (Of course persuading patients to accept such a random 
diagnose may not be straightforward). 

The performance of the composite can be readily set to any point along the 
line simply by varying the ratio between the number of times one classifier is 
used relative to the other. Indeed this can be readily extended to any number of 
classifiers to fill the space between them. The better classifiers are those closer 
to the zero false positive axis or with a higher true positive rate. In other words 
the classifiers lying on the convex hull. 




0 False Positives 1 



Fig. 1. Classifier C is created by choosing equally between the output of classifier A and 
classifier B. Any point in the shaded area can be created. The “Maximum Realisable 
ROC” is its convex hull (solid line). 
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Often classifiers have some variable threshold or tuning parameter whereby 
their trade off between false positives and true positives can be adjusted. This 
means their Receiver Operating Characteristics (ROC) are now a curve rather 
than a single point. We can apply Scott’s random combination method to any 
set of points along the curve. So a “maximum realisable” ROC is the convex 
hull of the (single) classifier’s ROC. Indeed, if the ROC curve is not convex, 
an improved classifier can easily be created from it [Scott et a<.1998b| . The nice 
thing about the MRROC, is that it is always possible. But as we show it may 
be possible to do better. 

4 Classifiers 

4.1 C4.5 

C4.5 |Quinlanl993| , like the other classifiers, was extended to allow its use within 
our CP system. Each classifier takes a threshold parameter. To produce an ROC 
curve the threshold is varied from zero to one. 

To use a classifier in GP we adopt the convention that non-negative values 
indicate the data is in the class. We also require the classifier to indicate its 
“confidence” in its answer. In our GP, it does this by the magnitude of the value 
it returns. 

C4.5 was run with defaults setting to produce pruned trees containing “con- 
fidence” values ZO and Zl. Normally the decision tree’s final classification would 
depend on which of ZO and Zl was the bigger. When the threshold is 0.5, this 
is what GP returns. However if it is near 0, GP is more likely to return class 0. 
While if the threshold is near 1, GP is biased towards class 1. (In detail GP 
returns class 0 iff (1 — threshold)Z0 > threshold Zl). This determines the sign of 
the value returned to the GP system. The magnitude is the C4.5 “confidence”. 
This is |Z0-Z1|. 

4.2 Naive Bayes Classifiers 

The Bayes ||Ripieyl996[Mitchelll997l approach attempts to estimate, from the 
training data, the probability of data being in each class. Its prediction is the 
class with the highest estimated probability. We extend it 1) to include a tuning 
parameter to bias its choice of class and 2) to make it return a confidence based 
upon the difference between the two probabilities. 

If there is no training data for a given class/ attribute value combination, 
we follow [Kohavi and Sommerfi eldl996[ page 11] and estimate the probability 
based on assuming there was actually a count of 0.5. f |Mitchelll^ suggests a 
slightly different way of calculating the estimates). 

A threshold T (0 < T < 1), allows us to introduce a bias. That is if 
(1 — r) X Po,o(£') < T X Pi^a{E) then our Bayes classifier will predicts E is 
in class 1, otherwise 0. (Pc,a{E) is the probability estimated from the training 
data, using attributes from the set a, that E is in class c). Finally we define the 
classifiers “confidence” to be \Po^a(E) — Pi^a{E)\. 
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4.3 Artificial Neural Networks 

We used the Clementine data mining tool to train an artificial neural network 
to model the training data. This model was then frozen and made available to 
genetic programming as a function with one argument. 

The ANN model was trained using Clementine version 5.0 on 2956 training 
records. Each record had nine integer inputs (from the last of the four spectral 
bands, see next section) and an integer range output. The output was 0 or 1 de- 
pending on whether the pixel was “grey” or not (see next section). The defaults 
were used, i.e. quick training, prevention of over training (50%), sensitivity anal- 
ysis and default stopping criterion for training. The model has one hidden layer 
of four nodes, whose performance Clementine estimates to be 72.32%. (Perfor- 
mance of ANN models for the first band to third bands were estimated at 83.78%, 
71.36% and 65.90%). 

The neural net model gives a continuous valued output. Values below 0.5 
indicate class 0. For use in CP, we subtract 1.0 and add the threshold parameter. 
This means values below zero indicate class 0, while non-negative values indicate 
class 1. As usual the continuous threshold parameter allows us to tune the neural 
network to trade off false positive against true positives and so obtain a complete 
ROC curve rather than a single error rate. (A threshold of 0.5 indicates no bias, 
i.e. use the raw neural network). Notice that the “confidence” the CP sees is 
directly related to how far from the neural networks idle value (0.5) its output 
is. 

5 Grey Landsat 

The Landsat data comes from the Stalog project via the UCI machine learning 
repositorj0. The data is spilt into training (sat.trn 4425 records) and test 
(sat.tst 2000). Each record has 36 continuous attributes (8 bit integer values 
nominally in the range 0-255) and 6 way classification. (Classes 1, 2, 3, 4, 5 
and 7). Following Scott |Scott et a71998hj . classes 3, 4 and 7 were combined 
into one (positive, grey) while 1, 2 and 5 became the negative examples (not- 
grey). sat.tst was kept for the holdout set. 

The 36 data values represent intensity values for nine neighbouring pixels 
and four spectral bands (see Fig. EJ. While the classification refers to just the 
central pixel. Since each pixel has eight neighbours and each may be in the 
dataset, data values appear multiple times in the data set. But when they do, 
they are presented as being different attributes each time. The data all come 
from a rectangular area approximately five miles wide. Each of the three types 
of classifiers is trained on data from one spectral band. (Naive Bayes - first band, 
C4.5 - second band, artificial neural network - last band). 

After reducing to two classes, the continuous values in sat.trn were 
partitioned into bins before it was used by the Naive Bayes classifier. 
Following jScott et a/.1998al page 8], we used entropy based discretisation 

ftp://ftp.ics.uci. edu/pub/machine-learning-dat abases/ statlog/ sat image 
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Fig. 2. Each record contains data from nine adjacent Landsat pixels. In these exper- 
iments, all of the Naive Bayes classifiers are trained on the first spectral band. There 
are two types of Naive Bayes classifiers, single attribute and those trained on pairs of 
attributes. All nine single attribute and all pairs of attributes 0, 4, 12, 16, 20 and 32 
are available to GP. C4.5 was trained on nine attributes 1, 5, ... 33 (second band) while 
the ANN was trained on the fourth band (attributes 3, 7, ... 35). 

[IKohavi and Sommerfleldl996j . implemented in MLC-b+ discretize . ex^ 
with default parameters. (Giving between 4 and 7 bins per attribute). To avoid 
introducing bias, the holdout data (sat.tst) was partitioned using the same 
bin boundaries, sat.trn was randomly split into training (2956 records) and 
verification (1479) sets. 



6 Genetic Programming Confignration 



The genetic programming system is 
jLangdon and Buxton2001b 



almost identical to that described in 
The GP is set up to signal its prediction of the 
class of each data value in the same way as the classifiers. I.e. by returning a 
floating point value, whose sign indicates the class and whose magnitude in- 
dicates the “confidence”. (Note confidence is not constrained to a particular 
range) . 

Following earlier work IJacobs et al. 1991lboulel999ILangdonl998| each GP 
individual is composed of five trees. Each of which is capable of acting as a 
classifier. The use of signed numbers makes it natural to combine classifiers by 
adding them. I.e. the classification of the “ensemble” is the sum of the answers 
given by the five trees. Should a single classifier be very confident about its 
answer this allows it to “out vote” all the others. 



6.1 Function and Terminal Sets 

The function set includes the four binary floating arithmetic operators (-I-, x, 
— and protected division), maximum and minimum and absolute maximum and 
minimum. The latter two return the (signed) value of the largest, (or smallest) 
in absolute terms, of their inputs. IFLTE takes four arguments. If the first is 

^ http : //www. sgi . com/Technology/mlc 
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less than or equal to the second, IFLTE returns the value of its third argument. 
Otherwise it returns the value of its fourth argument. INT returns the integer 
part of its argument, while FRAC(e) returns e - INT(e). 

The classifiers are represented as floating point functions. Their threshold is 
supplied as their single argument. As described in Sect. 0 

The terminal T yields the current value of the threshold being applied to 
the classifier being evolved by GP. Finally the GP population was initially con- 
structed from a number of floating point values. These constants do not change 
as the population evolves. However crossover and mutation do change which 
constants are used and in which parts of the program. 



6.2 Fitness Function 

Each new individual is tested on each training example with the threshold pa- 
rameter (T) taking values from 0 to 1 every 0.1 (i.e. 11 values). So it is run 
32516 times. For each threshold value the true positive rate is calculated. (The 
number of correct positive cases divided by the total number of positive cases) . If 
a floating point exception occurs its answer is assumed to be wrong. Similarly its 
false positive rate is given by the no. of negative cases it gets wrong divided by 
the total no. of negative cases. It is possible to do worst than random guessing. 
When this happens, i.e. the true positive rate is less than the false positive rate, 
the sign of the output is reversed. This is common practise in classifiers. 

Since a classifier can always achieve both a zero success rate and 100% false 
positive rate, the points (0,0) and (1,1) are always included. These plus the 
eleven true positive and false positive rates are plotted and the area under the 
convex hull is calculated. The area is the fitness of the individual GP program. 
Note the GP individual is not only rewarded for getting answers right but also 
for using the threshold parameter to get a range of high scores. Parameters are 
summarised in Table ^ 



7 Results 

The three types of classifier (G4.5, Naive Bayes and ANN) were made available 
to GP, singly, in pairs and finally all three together. I.e. seven experiments were 
run. (The 21 Naive Bayes classifiers are treated as a group, i.e. they are either 
all included or all excluded). In each run, GP’s answer was chosen as the first 
occurrence of a program with the the largest ROG area (on the training data) 
found in the whole run. The ROG of these seven programs (on the holdout data) 
are plotted in Fig.^land tabulated in Tabled In all seven cases GP automatically 
produced a classifier with better performance than those it was given. That is 
genetic programming fused classifiers of different types, trained on different data, 
to yield superior classifiers. 
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8 Over Fitting 

We have taken some care to ensure our input classifiers do not over fit the training 
data. Similarly one needs to be careful when using GP to avoid over fitting. So far 
we have seen little evidence of over fitting. This may be related to the problems 
themselves or the choice of multiple tree programs or the absence of “bloat” . The 
absence of bloat may be due to our choice of size fair crossover |Langdon2000| 
and a high mutation rate. Our intention is to evaluate this GP approach on 
more sophisticated classifiers and on harder problems. Here we expect it will be 
essential to ensure the classifiers GP uses do not over fit, however this may not 
be enough to ensure the GP does not. 

Table 1. Grey Landsat GP Parameters 



Objective: Evolve a function with Maximum Convex Hull Area 

Function set: INT FRAC Max Min MaxA MinA MUL ADD DIV SUB IFLTE 

C4.5 ANN (nbO nb4 nbS nbl2 nbl6 nb20 nb24 nb28 nb32 nb0,4 nb0,12 
nb0,16 nb0,20 nb0,32 nb4,12 nb4,16 nb4,20 nb4,32 nbl2,16 nbl2,20 
nbl2,32 nbl6,20 nbl6,32 nb20,32) 

Terminal set: T 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 

Fitness: Area under convex hull of 11 ROC points on 2956 test points 

Selection: generational (non elitist), tournament size 7 

Wrapper: > 0 => positive, negative otherwise 

Pop Size: 500 

No size or depth limits 

Initial pop: ramped half-and-half (2:6) (half terminals are constants) 

Parameters: 50% size fair crossover |Langdon2000| , 50% mutation (point 22.5%, con- 
stants 22.5%, shrink 2.5% subtree 2.5%) 

Termination: generation 50 




False Positives 



Fig. 3. The ROC produced by GP (generation 50) using threshold values 
0,0. 0 on the Thyroid data. Details of the experiment are reported in 
[Langdon and Buxton2001b|. 
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Fig. 4. The ROC produced by GP using seven combinations of classifiers on the Grey 
Landsat holdout data. The caption gives the area under the ROC (holdout) and the 
first generation to give the maximum area on the training data. (For simplicity only 
the convex hull of each classifier is plotted). 



Table 2. Grey Landsat, Area under ROC on holdout set 



ANN 0.764945 
C4.5 0.74271 



Given Classifiers Genetic Programming 
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9 Conclusions 



Previously |Langdon and fjuxton200T^ we showed, using Scott’s own bench 
marks, that genetic programming can do better than jScott p.t n/.1998h) ’s MR- 
ROC Langdon and Buxton200rB| . Here we have shown, GP can deal not only 
with different classifiers but with classifiers of different types, trained on different 
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data. Genetic programming offers an automatic means of data fusion by evolving 

combined classifiers. 
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Abstract. In the field of pattern recognition, multiple classifier systems based 
on the combination of outputs of a set of different classifiers have been 
proposed as a method for the development of high performance classification 
systems. In this paper, the problem of design of multiple classifier system is 
discussed. Six design methods based on the so-called “overproduce and choose” 
paradigm are described and compared by experiments. Although these design 
methods exhibited some interesting features, they do not guarantee to design the 
optimal multiple classifier system for the classification task at hand. 
Accordingly, the main conclusion of this paper is that the problem of the 
optimal MCS design still remains open. 



1. Introduction 

In the last decade, quite a lot of papers proposed the combination of multiple 
classifiers for designing high performance pattern classification systems [1, 2]. The 
rationale behind the growing interest in multiple classifier systems (MCSs) is the 
acknowledgement that the classical approach to designing a pattern recognition 
system that focuses on finding the best individual classifier has some serious 
drawbacks [3]. The main one being that it is very difficult to determine the best 
individual classifier for the classification task at hand, except when deep prior 
knowledge is available. In addition, the use of a single classifier does not allow the 
exploitation of the complementary discriminatory information that other classifiers 
may encapsulate. 

Roughly speaking, MCS consists of an ensemble of different classification 
algorithms and a decision function for combining classifier outputs. Therefore, the 
design of MCSs involves two main phases: the design of the classifier ensemble, and 
the design of the combination function. Although this formulation of the design 
problem leads one to think that effective design should address both the phases, most 
of the design methods described in the literature focus only on the former one. In 
particular, methods that focus on the design of the classifier ensemble have tended to 
assume a fixed, simple decision combination function and aim to generate a set of 
mutually complementary classifiers that can be combined to achieve optimal accuracy 
[2]. A common approach to the generation of such classifier ensembles is to use some 
form of data “sampling” technique, such that each classifier is trained on a different 



J. Kittler and F. Roli (Eds.): MCS 2001, LNCS 2096, pp. 78-87, 2001. 
© Springer-Verlag Berlin Heidelberg 2001 




Methods for Designing Multiple Classifier Systems 



79 



subset of the training data [4], Alternatively, methods focused on the design of the 
combination function assume a given set of carefully designed classifiers and aim to 
find an optimal combination of classifier decisions. In order to perform such 
optimisation, a large set of combination functions of arbitrary complexity is available 
to the designer, ranging from simple voting rules through to “trainable” combination 
functions [2]. 

Although some design methods have proved to be very effective and some papers 
have investigated the comparative advantages of different methods [5], clear 
guidelines are not yet available for choosing the best design method for the 
classification task at hand. The designer of an MCS therefore has a toolbox containing 
quite a range of instruments for generating and combining classifiers. She/he may also 
design a myriad of different MCSs by coupling different techniques for creating 
classifier ensembles with different combination functions. However, the best MCS 
can only be determined by performance evaluation. Accordingly, some researchers 
proposed the so-called “overproduce and choose” paradigm (also called “test and 
select” approach [6]) in order to design the MCS most appropriate for the task at hand 
[7, 8, 9]. The basic idea is to produce an initial large set of “candidate” classifier 
ensembles, and then to select the sub-ensemble of classifiers that can be combined to 
achieve optimal accuracy. Typically, constraints and heuristic criteria are used in 
order to limit the computational complexity of the “choice” phase (e.g., the 
performances of a limited number of candidate ensembles are evaluated by a simple 
combination function like the majority voting rule [6, 7]). 

In this paper, six design methods based on the overproduce and choose paradigm 
are described (Section 2). Two methods proposed in [7], and four methods developed 
by the authors (Section 2.3 and 2.4). The measures of classifier “diversity” used for 
MCS design are discussed in Section 2.2. The performances of such design methods 
were assessed and compared by experiment. Results are reported in Section 3. 
Conclusions are drawn in Section 4. 



2. Design Methods Based on the Overproduce and Choose 
Paradigm 

According to the overproduce and choose design paradigm, MCS design cycle can be 
subdivided into the following phases: 

1) Ensemble Overproduction 

2) Ensemble Choice 

The overproduction design phase is aimed to produce a large set of candidate 
classifier ensembles. To this end, techniques like Bagging and Boosting that 
manipulate the training set can be adopted. Different classifiers can be also designed 
by using different initialisations of the respective learning parameters, using different 
classifier types and different classifier architectures. In practical applications, 
variations of the classifier parameters based on the designer expertise can provide 
very effective candidate classifiers [1,11]. 

The choice phase is aimed to select the subset of classifiers that can be combined 
to achieve optimal accuracy. It is easy to see that such optimal subset could be 
obtained by exhaustive enumeration, that is, by assessing on a validation set the 
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classification accuracy provided by all possible subsets, and then choosing the subset 
exhibiting the best performance. Such performance evaluation should be performed 
with respect to a given combination function (e.g., the majority voting rule). 
Unfortunately, if N is the size of the set produced by the overproduction phase, the 



number of possible subsets is equal to 




. Therefore, different strategies have 



been proposed in order to limit the computational complexity of the choice phase. 
Although the choice phase usually assumes a given combination function for 
evaluating the performances of classifier ensembles, there is a strong interest for 
techniques that allow choosing effective classifier ensembles without assuming a 
specific combination rule. This can be seen via the analogy with the feature selection 
problem, where techniques for choosing those features that are most effective for 
preserving class separability have been developed. Accordingly, techniques for 
evaluating the degree of error diversity of classifiers forming an ensemble have been 
used for classifier selection purposes. We review some of these techniques in Section 
2 . 2 . 

In the following, we shall assume that a large ensemble C made up of N classifiers 
was created by the overproduction phase: 



C={Cj, C2,...,c^} 



( 1 ) 



The goal of the choice phase is to select the subset C* of classifiers that can be 
combined to achieve optimal accuracy. 



2.1 Methods Based on Heuristic Rules 

Partridge and Yates proposed some techniques that exploit heuristic rules for 
choosing classifier ensembles [7]. One technique can be named “choose the best”. It 
assumes an a priori fixed size n of the “optimal” subset C*. Then, selects from the set 
C the n classifiers with the highest classification accuracy to create the subset C*. The 
rationale behind such heuristic choice is that all the classifier subsets exhibit similar 
degrees of error diversity. Accordingly, the choice is based only on the accuracy 
value. The other choice technique proposed by Partridge and Yates can be named 
“choose the best in the class”. For each classifier “class”, it chooses the classifier 
exhibiting the highest accuracy. Therefore, a subset C* made up of three classifiers 
will be created if the initial set C is made up of classifiers belonging to three classifier 
types (e.g., the multilayer perceptron neural net, the k-nearest neighbours classifier, 
and the radial basis functions neural net). With respect to the previous rule, this 
heuristic rule takes also into account that classifiers belonging to different types 
should be more error independent than classifiers of the same type. It should be noted 
that the use of heuristic rules allows us to strongly reduce the computational 
complexity of the choice phase, because the evaluation of different classifier subsets 
is not required. On the other hand, the general validity of such heuristics is obviously 
not guaranteed. 
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2.2 Diversity Measures 

As previously pointed out, several measures of error diversity for classifier ensembles 
have been proposed. Partridge and Yates proposed a measure named “within-set 
generalization diversity”, or simply GD, that is computed as follows [7]: 

^ j p (2 both fail) (2) 

/?(1 fails) 

where p(2 both fail) indicates the probability that two randomly selected classifiers 
from the set C will both fail on a randomly selected input, and p(l fails) indicates the 
probability that one randomly selected classifier will fail on a randomly selected 
input. GD takes values in the range [0,..,1] and provides a measure of the diversity of 
classifiers forming the ensemble. 

Another diversity measure was proposed by Kuncheva et al. [10]. Let X=|Xj, 
X 2 ,....,X„) be a labelled data set. For each classifier c^, we can design an M- 
dimensional output vector 0j=[0|j,...,0„J, such that O^j =1, if c^ classifies correctly 
the pattern Xj, and 0, otherwise. Q statistics allow us to evaluate the diversity of two 
classifiers c^ and c^: 

N"N“' -N^N'" ( 3 ) 

N''N°° +N'’'N'° 

where N“'’ is the number of elements X^ of X for which 0^ j =a and =b. (M= N™ H- 
N“‘ -b N'” + N"). Q varies between -1 an 1. Classifiers that tend to classify the same 
patterns correctly, that is, classifiers that are positively correlated, will have positive 
values of Q. Classifiers that make errors on different patters will exhibit negative 
values of Q. For statistically independent classifiers, Qi,^= 0. The average Q computed 
over all the possible classifier couples is used for evaluating the diversity of a 
classifier ensemble. 

Giacinto and Roll proposed a simple diversity measure, named “compound 
diversity”, or simply CD, based on the compound error probability for two classifiers 
Cj andc. [8]: 

CD = 1 - prob(C; fails, c^ fails) (4) 

As for Q, the average CD computed over all the possible classifier couples is used 
for evaluating the diversity of a classifier ensemble. 

It should be noted that the GD and CD measures are based on similar concepts. 

As none of the above measures can be claimed to be the best, we used all of them 
in our design methods (Sections 2.3 and 2.4) and compared their performances 
(Section 3). 



2.3 Methods Based on Search Algorithms 

It is easy to see that search algorithms are the most natural way of implementing the 
choice phase required by the overproduce and choose design paradigm. Sharkey et al. 
proposed an exhaustive search algorithm based on the assumption that the number of 
candidate classifier ensembles is small [6]. In order to avoid the problem of 
exhaustive search, we developed three choice techniques based on search algorithms 
previously used for feature selection purposes (Sections 2.3.1 and 2.3.2), and for the 
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solution of complex optimisation tasks (Section 2.3.3). All these search algorithms 
use an evaluation function for assessing the effectiveness of candidate ensembles. The 
above diversity measures and the accuracy value assessed by the majority voting rule 
have been used as evaluation functions. It should be remarked that the following 
search algorithms avoid exhaustive enumeration, but the selection of the optimal 
classifier ensemble is not guaranteed. It is worth noting that evaluation functions are 
computed with respect to a validation set in order to avoid “overfitting” problems 

2.3.1 Forward Search 

The choice phase based on the forward search algorithm starts by creating an 
ensemble made up of a single classifier (e.g., the classifier c^). This initial classifier is 
usually chosen randomly. Alternatively, the classifier with the highest accuracy can 
be used. Then, single classifiers are added to c^ to form subsets of two classifiers. If 
the subset made up of c^ and c^ exhibits the highest value of the evaluation function, 
one more classifier is added to such subset to form the subsets of three classifiers. 
Such an iterative process stops when all the subsets of size k+1 exhibit values of the 
evaluation function lower than the ones of size k. In this case, the subset of size k that 
exhibited the highest value of the evaluation function is selected. 

2.3.2 Backward Search 

In order to explain the developed search algorithms, let us use a simple example in 
which the set C created by the overproduction phase is made up of four classifiers. 
The backward search starts from the full classifier set. Then, eliminating one classifier 
from four, all possible subsets of three classifiers are created and their evaluation 
function values are assessed. If the subset made up of Cj, Cj, and c^ exhibit the highest 
value, then it is selected and the subsets of two classifiers are obtained from this set 
by again eliminating one classifier. The iterative process stops when all the subsets of 
size k exhibit values of the evaluation function lower than the ones of size k+1. In 
such case, the subset of size k+1 that exhibited the highest value of the evaluation 
function is selected. 

2.3.3 Tabu Search 

The two previous algorithms stop the search process if the evaluation function 
decreases with respect to the previous step. As the evaluation function can exhibit a 
non-monotonic behaviour, it can be effective to continue the search process even if 
the evaluation function is decreasing. Tabu search is based on this concept. In 
addition, it implements both a forward and backward search strategy. The search 
starts from the full classifier set. At each step, adding and eliminating one classifier 
creates new subsets. Then, the subset that exhibits the highest value of the evaluation 
function is selected to create new subsets. It should be remarked that such subsets are 
selected even if the evaluation function is decreased with respect to the previous step. 
In order to avoid the creation of the same subsets in different search steps (i.e., in 
order to avoid “cycles” in the search process), a classifier added or eliminated cannot 
be selected for insertion/deletion for a certain number of search steps. Different stop 
criteria can be used. For example, the search can stop after a certain number of steps, 
and the best subset created during the search process is returned. 
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2.4 A Method Based on Clustering of Classifiers 

We developed an approach to the choice phase that allows the identification of an 
effective subset of classifiers with limited computational effort. It is based on the 
hypothesis that set C created by the overproduction phase is made up of the union of 
M disjoint subsets Cj. In addition, we assumed that the compound error probability 
between any two classifiers belonging to the same subset is greater than the one 
between any two classifiers belonging to different subsets. It is easy to see that 
effective MCS members can be extracted from different subsets Cp the more highly 
error-correlated the classifiers belonging to the same subset, the classifiers belonging 
to different subsets being error-independent. Therefore, under the hypotheses above, 
we defined a choice phase made up of two phases, namely the identification of the 
subsets C( by clustering of classifiers, and the extraction of classifiers from different 
clusters in order to create an effective classifier ensemble C*. Classifiers have been 
clustered according to the CD measure (eq. 4, section 2.2) so that classifiers that make 
a large number of coincident errors are assigned to the same cluster, and classifiers 
that make few coincident errors are assigned to different clusters. At each iteration of 
the clustering algorithm, one candidate ensembles C* is created by taking from each 
cluster the classifier that exhibits the maximum average distance from all other 
clusters. For each candidate ensemble C*, the classifiers are then combined by 
majority voting, and the ensemble with the highest performance is chosen. Further 
details on this design method can be found in [8,9]. 



3. Experimental Results 

The Feltwell data set was used for our experiments. It consists of a set of multisensor 
remote-sensing images related to an agricultural area near the village of Feltwell 
(UK). Our experiments were carried out characterizing each pixel by a fifteen-element 
feature vector containing the brightness values in six optical bands and over nine 
radar channels. We selected 10944 pixels belonging to five agricultural classes and 
randomly subdivided them into a training set (5124 pixels), a validation set (528 
pixels), and a test set (5238 pixels). We used a small validation set in order to 
simulate real cases where validation data are difficult to be obtained. A detailed 
description of this data set can be found in [11]. 

Our experiments mainly aimed to assess the performances of the proposed design 
methods (Sections 2.3 and 2.4) and to compare our methods with other design 
methods proposed in the literature (Section 2.1). 

To this end, we performed different overproduction phases, thus creating different 
initial ensembles C (see equation 1). Such sets were created using different classifier 
types, namely. Multilayer Perceptrons (MLPs), Radial Basis Functions (RBF) neural 
networks. Probabilistic Neural Networks (PNNs), and the k-nearest neighbour 
classifier (k-nn). For each classifier type, ensembles were created by varying some 
design parameters (e.g., the network architecture, the initial random weights, the 
value of the “k” parameter for the k-nn classifier, and so on). In the following, we 
report the results relating to three initial sets C, here referred to as sets C‘, C^, and 
generated by overproduction phases: 
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- set C' contains fifty MLPs. Five architectures with one or two hidden layers and 
different numbers of neurons per layer were used. For each architecture, ten 
training phases with different initial weights were performed. All the networks had 
fifteen input units and five output units as input features and data classes, 
respectively; 

- set contains the same MLPs belonging to C' and fourteen k-nn classifiers. The k- 
nn classifiers were obtained by varying the value of the “k” parameter in the 
following two ranges: (15, 17, 19, 21, 23, 25, 27) and (75, 77, 79, 81, 83, 85, 87); 

- set contains thirty MLPs, three k-nn classifiers, three RBF neural networks, and 
one PNN. For the RBF neural network, three different architectures were used. 



3.1 Experiments with Set C‘ 

First of all, we evaluated the performances of the whole of set C', the best classifier in 
the ensemble, and those ensembles designed by the two methods based on heuristic 
rules (see Section 2.1). Such performances are reported in Table 1 in terms of 
percentage accuracy values, percentage rejection rates, and differences between 
accuracy and rejection values. The sizes of the selected ensembles are shown. The 
classifiers were always combined by the majority-voting rule. A pattern was rejected 
when a majority of classifiers assigning it to the same data class was not present. All 
values reported in Table 1 referred to the test set. For the method named “choose the 
best” (indicated with the term “Best” in Table 1), the performances of ensembles of 
size ranging from 3 through 15 were assessed. The size of the ensemble designed by 
the method named “choose the best in the class” method (indicated with the term 
“Best-class”) is five, because five types of classifiers (namely, five types of net 
architectures) were used to create the ensemble C' (Section 2.1). For each ensemble, 
the value of the Generalisation Diversity measure (GD) is reported in order to show 
the degree of error diversity among the classifiers. 

Table 1 shows that the design methods based on heuristic rules can provide some 
improvements with respect to the accuracy of the initial ensemble C‘ and the best 
classifier. However, such improvements are small. It should be noted that these design 
methods do not provide improvements in terms of error diversity as assessed by the 
GD measure. This can be explained by observing that such methods select classifiers 
on the basis of accuracy, and they do not take explicitly into account error diversity. 

Table 2 reports results obtained by our design methods based on search algorithms 
(Section 2.3). The classifiers were always combined by the majority- voting rule. A 
pattern was rejected when a majority of classifiers assigning it to the same data class 
was not present. All values reported in Table 2 referred to the Feltwell test set. It 
should be noted that these design methods improve the error diversity, that is, the 
ensembles are characterised by GD values higher that the ones reported in Table 1. 
However, the improvements in accuracy with respect to the initial ensemble C‘ and 
the best classifier are similar to the ones provided by the methods based on heuristic 
rules. 
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Table 1. Performances of the whole set C’, the best classifier in the ensemble, and the ones of 
the ensembles designed by two methods based on heuristic rules. 



Ensemble 


Size 


Accuracy 


Rejection 


Accuracy - Rejection 


GD 


Initial set C* 


50 


89.8357 


1.2027 


88.6330 


0.2948 


Best classifier 


1 


89.2516 


0.0000 


89.2516 


N/A 


Best 


3 


90.2565 


0.2673 


89.9892 


0.2399 


Best 


5 


90.4278 


0.4773 


89.9505 


0.1937 


Best 


7 


90.0134 


0.4009 


89.6125 


0.1801 


Best 


9 


90.0459 


0.2673 


89.7786 


0.1783 


Best 


11 


90.0747 


0.3627 


89.7120 


0.1935 


Best 


13 


89.9732 


0.2291 


89.7441 


0.2008 


Best 


15 


89.9712 


0.4391 


89.5321 


0.2063 


Best-class 


5 


89.9847 


0.4964 


89.4883 


0.2617 



Table 2. Performances of the ensembles generated by design methods based on search 
algorithms. The evaluation function used to guide the search is indicated within brackets. 



Choice Method 


Size 


Accuracy 


Rejection 


Acc. - Rej. 


GD 


Initial set C‘ 


50 


89.8357 


1.2027 


88.6330 


0.2948 


Best classifier 


1 


89.2516 


0.0000 


89.2516 


N/A 


Backward(GD) 


3 


89.9981 


1.6991 


88.2990 


0.4752 


Backward(CD) 


3 


90.4890 


0.8400 


89.6490 


0.3573 


Backward( Accuracy) 


45 


89.8517 


0.8591 


88.9926 


0.2950 


Backward(Q) 


3 


88.6901 


0.7446 


87.9455 


0.4129 


Forward_from_best(GD) 


3 


89.5669 


0.8209 


88.7460 


0.4402 


Forward_from_best(CD) 


3 


90.0499 


0.6109 


89.4390 


0.3454 


Forward_from_best( Accuracy) 


11 


90.3965 


0.8018 


89.5947 


0.3274 


Forward_from_best(Q) 


3 


88.2387 


1.1455 


87.0932 


0.3942 


Forward _random(GD) 


3 


89.0993 


1.2218 


87.8775 


0.3958 


Forward_random (CD) 


7 


90.2866 


0.7446 


89.5420 


0.3346 


Forward_random (Accuracy) 


7 


90.2420 


0.6109 


89.6311 


0.2589 


Forward_random (Q) 


3 


87.0609 


0.5536 


86.5073 


0.3845 


Tabu(GD) 


3 


89.9459 


1.2600 


88.6859 


0.4613 


Tabu(CD) 


3 


90.1425 


0.8400 


89.3025 


0.3806 


Tabu( Accuracy) 


9 


90.1156 


0.9164 


89.1992 


0.3416 


Tabu(Q) 


3 


89.8180 


1.3746 


88.4434 


0.4826 



Table 3 reports results obtained by our design method based on clustering of 
classifiers (Section 4.4). Conclusions similar to those for the design methods based on 
search algorithms can be drawn. 
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Table 3. Performances of the ensembles generated by design method based on clustering of 
classifiers. The evaluation function used to guide the search is indicated within brackets. 



Choice Method 


Size 


Accuracy 


Rejection 


Acc. - Rej. 


GD 


Initial set C' 


50 


89.8357 


1.2027 


88.6330 


0.2948 


Best classifier 


1 


89.2516 


0.0000 


89.2516 


N/A 


Cluster(CD) 


7 


90.5294 


0.8209 


89.7085 


0.3193 


Cluster(Q) 


49 


89.6592 


0.8591 


88.8001 


0.2962 


Cluster(GD) 


9 


89.6179 


1.0691 


88.5488 


0.3788 



It is worth noting that the performances of various design methods are slightly 
better than the ones of initial ensemble C‘ and the best classifier, but the differences 
are small. However, it should be noted that methods based on search algorithms and 
clustering of classifiers improve the error diversity of classifiers. 



3.2 Experiments with Set C^and 

The same experiments previously described for set C' were performed for sets and 
C^ For the sake of brevity, for each design method we report the average 
performances in terms of accuracy and error diversity values. Table 4 shows the 
average accuracy values and the average error diversity values (in terms of the GD 
measure) of different design methods applied to sets C^and C\ 

Table 4. For each design method, the average percentage accuracy value and the average error 
diversity value (in terms of the GD measure) are reported for the experiment with the set and 
for the experiment with the set 



Set Cf Set O’ 



Choice Method 


Accuracy 


GD 


Accuracy 


GD 


Initial set 


90.4918 


0.3170 


89.4645 


0.3819 


Best classifier 


90.0916 


- 


88.2016 


- 


Choose the best 


90.1090 


0.1989 


91.1097 


0.3279 


Choose the best in the class 


89.9847 


0.2617 


92.0613 


0.4905 


Backward 


89.8945 


0.3488 


92.3871 


0.5851 


Forward from the best 


89.9024 


0.3400 


93.2471 


0.5917 


Forward from random 


89.7408 


0.3270 


91.5023 


0.5969 


Tabu 


90.0931 


0.3631 


93.5092 


0.6225 


Clustering 


89.9013 


0.3410 


92.1911 


0.5383 



With regard to the experiments performed on sets and the performances of 
various design methods are close to or better than those of the initial ensembles and 
the best classifier. Significant improvements were obtained for some of the 
experiments performed on set C^. 
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4. Discussion and Conclusions 

Although definitive conclusions cannot be drawn on the basis of the limited set of 
experiments above, some preliminary conclusions can be drawn. The overproduce 
and choose paradigm does guarantee optimal MCS design for the classification task at 
hand. In particular, no choice method can be claimed to be the best, because the 
superiority of one over the other depends on the classification task at hand. 
Accordingly, optimal MCS design is still an open issue. The main motivation behind 
the use of the overproduce and choose paradigm is that at present clear guidelines to 
choose the best design method for the classification task at hand are lacking. Thanks 
to this design paradigm it is possible to exploit the large set of tools developed to 
generate and combine classifiers. The designer can create a myriad of different MCSs 
by coupling different techniques to create classifier ensembles with different 
combination functions. Then, the most appropriate MCS can be selected by 
performance evaluation. It is worth noting that this approach is commonlyused in 
engineeringfields where optimal design methods are not available (e.g., software 
engineering). In addition, the overproduce and choose paradigm allows to create 
MCSs made up of small sets of classifiers. This is a very important feature for 
practical applications. 
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Abstract. A scheme is proposed for classifier combination at decision 
level which stresses the importance of classifier selection during combi- 
nation. The proposed scheme is optimal (in the Neyman-Pearson sense) 
when sufficient data are available to obtain reasonable estimates of the 
join densities of classifier outputs. Four different fingerprint matching 
algorithms are combined using the proposed scheme to improve the ac- 
curacy of a fingerprint verification system. Experiments conducted on a 
large fingerprint database (~ 2,700 fingerprints) confirm the effectiveness 
of the proposed integration scheme. An overall matching performance 
increase of ~ 3% is achieved. We further show that a combination of 
multiple impressions or multiple fingers improves the verification per- 
formance by more than 4% and 5%, respectively. Analysis of the results 
provide some insight into the various decision-level classifier combination 
strategies. 



1 Introduction 

A number of classifier combination strategies exist Q, H, however, a priori 
it is not known which combination strategy works better than the others and if 
so under what circumstances. In this paper we will restrict ourselves to a partic- 
ular decision-level integration scenario where each classifier may select its own 
representation scheme and produces a confidence value as its output. A theoret- 
ical framework for combining classifiers in such a scenario has been developed 
by Kittler et al. 0. However, the product rule for combination suggested in |5] 
implicitly assumes an independence of classifiers. The sum rule further assumes 
that the aposteriori probabilities computed by the respective classifiers do not 
deviate dramatically from the prior probabilities. The max rule, min rule, median 
rule, and majority vote rule have been shown to be special cases of the sum and 
the product rules. Making these assumptions simplifies the combination rule but 
does not guarantee optimal results and hinders the combination performance. 
We follow Kittler et al.’s framework without making any assumptions about the 
independence of various classifiers. 

The contributions of this paper are two fold. Firstly, we propose a general 
system design for decision-level classifier fusion that uses the optimal Neyman- 
Pearson rule and outperforms the combination strategies based on the assump- 
tion of independence among the classifiers. Secondly, we propose a multi-modal 
biometric system design based on multiple fingerprint matchers. The use of the 
proposed combination strategy in combining multiple matchers significantly im- 
proves the overall accuracy of the fingerprint-based verification system. The 
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effectiveness of the proposed integration strategy is further demonstrated by 
building multi-modal biometric systems that combine two different impressions 
of the same finger or fingerprints of two different fingers. 

2 Biometrics 

The biometric verification problem can be formulated as follows. Let the stored 
biometric signal (template) of a person be represented as S and the acquired 
signal (input) for authentication be represented by /. Then the null and alternate 
hypotheses are: 

Hq : I S, input fingerprint is not from the same finger as the template, 

Hi : I = S, input fingerprint is from the same finger as the template. 

The associated decisions are as follows: Dq : person is an imposter, and Di : 
person is genuine. The verification involves matching S and I using a similarity 
measure. If the matching score is less than some decision threshold T, then decide 
Dq, else decide Di. Then, FAR — P{Di\wo), and FRR — P{Dq\wi), where wq 
is the class with Hq = true and rci is the class with H\ = true. 

Several biometric systems have been designed and tested on large databases. 
However, in some applications with stringent performance requirement, no single 
biometric can meet the requirements due to inexact nature of sensing, feature 
extraction, and matching processes. This has generated interest in designing 
multimodal biometric systems. Multimodal biometric systems may work in one 
of the following five scenarios: (i) Multiple sensors: the information obtained from 
different sensors for the same biometric may be combined. For example, optical, 
ultrasound, and capacitance based sensors are available to capture fingerprints, 
(ii) Multiple biometric system: multiple biometrics such as fingerprint and face 
may be combined j2|. (iii) Multiple units of the same biometric: one image each 
from both the iris, or both hands, or ten fingerprints may be combined, (iv) 
Multiple instances of the same biometric: for example multiple impressions of 
the same finger may be combined, (v) Multiple representation and matching 
algorithms for the same input biometric signal: for example, combining different 
approaches to feature extraction and matching of fingerprints 0 . The first two 
scenarios require several sensors and are not cost effective. Scenarios (iii) causes 
inconvenience to the user in providing multiple cues and has a longer acquisition 
time. In scenario (iv), only a single input is acquired during verification and 
matched with several stored templates acquired during the one-time enrollment 
process. Thus, it is slightly better than scenario (iii). In our opinion, scenario 
(v) is the most cost-effective way to improve biometric system performance. 

We propose to use a combination of four different fingerprint-based biometric 
systems where each system uses different feature extraction and/or matching al- 
gorithms to generate a matching score which can be interpreted as the confidence 
level of the matcher. These different matching scores are combined to obtain the 
lowest possible FRR for a given FAR. We also compare the performance of our 
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integration strategy with the sum and the product rules 0. Even though we pro- 
pose and report results in scenarios (iii), (iv) and (v), our combination strategy 
could be used for scenarios (i) and (ii) as well. 

3 Optimal Integration Strategy 

Let us suppose that pattern Z is to be assigned to one of the two possible classes, 
Wq and Wi - Let us assume that we have N classifiers, and the ith classifier outputs 
a single confidence value 9i about class wi (the confidence for the class wq will be 
l — 9i), i = 1,2, N. Let us assume that the prior probabilities for the two classes 
are equal. The classifier combination task can now be posed as an independent 
(from the original N classifier designs) classifier design problem with two classes 
and N features {9i, i = 1,2, N). 

3.1 Classifier Selection 

It is a common practice in classifier combination to perform an extensive anal- 
ysis of various combination strategies involving all the N available classifiers. 
In feature selection it is well known that the most informative d-element subset 
of N conditionally independent features is not necessarily the union of the d 
individually most informative features. Cover |S| argues that no non-exhaustive 
sequential d-element selection procedure is optimal, even for jointly normal fea- 
tures. He further showed that all possible probability of error ordering can occur 
among subsets of features subject to a monotonicity constraint. The statistical 
dependence among features causes further uncertainty in the d-element subset 
composed of the individually best features. One could argue that the combination 
strategy itself should pick out the classifiers that should be combined. However, 
we know in practice that the “curse of dimensionality” makes it difficult for a 
classifier to automatically delete less discriminative features jOj. Therefore, we 
propose a classifier selection scheme prior to classifier combination. We propose 
to use the class separation statistic |2| as the feature effectiveness criterion. This 
statistic, CS, measures how well the two classes (imposter and genuine, in our 
case) are separated with respect to the feature vector, X‘^, in a d-dimensional 
space, R‘^. 

CS{X‘^)= [ \p{X‘^\wo) - p{X‘^\w,)\dx, (1) 

JR’I 

where p{X‘^\wq) and p{X‘^\wi) are the estimated distributions for the Wq (im- 
poster) and wi (genuine) classes, respectively. Note that 0 < CS < 2. 

We will use the class separation statistic to obtain the best feature subset 
using an exhaustive search of all possible 2^ — 1 feature subsets. 

3.2 Non-parametric Density Estimation 

Once we have selected the subset containing d {d < N) features, we develop our 
combination strategy. We do not make any assumptions about the form of the 
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distributions for the two classes and use non-parametric methods to estimate 
the two distributions. We use Parzen window density estimate to obtain the 
non-parametric distributions |Sj. A Gaussian kernel was used and the window 
width was empirically determined. 

3.3 Decision Strategy 

We use the likelihood ratio L = P{X^\wq) / P{X‘^\ w\) to make the final decision 
for our two-class problem: Decide Dq (person is an imposter) for low values of 
L; decide Di (person is genuine) for high values of L. If L is small, the data 
is more likely to come from class Wi; the likelihood ratio test rejects the null 
hypothesis for small values of the ratio. The Neyman-Pearson lemma states that 
this test is optimal, that is, among all the tests with a given significance level, a 
{FAR), the likelihood ratio test has the maximum power. For a specified a, A 
is the smallest constant such that P{L < A) < a. The type II error, (3 (FRR), 
is given by P{L > A). 



4 Matching Algorithms 

We have developed four fingerprint verification systems which can be broadly 
classified into two categories: (i) minutiae-based, and (ii) filter-based. The three 
minutiae-based and one filter-based algorithms are summarized in this section. 

4.1 Matcher Hough 

The fingerprint matching problem can be regarded as template matching nm: 
given two sets of minutia features, compute their matching score. The two main 
steps of the algorithm are: (1) Compute the transformation parameters Sx, Sy, 
9, and s between the two images, where 6x and Sy are translations along x- and 
y- directions, respectively, 9 is the rotation angle, and s is the scaling factor; (2) 
Align two sets of minutia points with the estimated parameters and count the 
matched pairs within a bounding box; (3) Repeat the previous two steps for the 
set of discretized allowed transformations. The transformation that results in 
the highest matching score is believed to be the correct one. The final matching 
score is scaled between 0 and 99. Details of the algorithm can be found in HD!. 

4.2 Matcher String 

Each set of extracted minutia features is first converted into polar coordinates 
with respect to an anchor point. The two-dimensional (2D) minutia features 
are, therefore, reduced to a one-dimensional (ID) string by concatenating points 
in a increasing order of radial angel in polar coordinate. The string matching 
algorithm is applied to compute the edit distance between the two strings. The 
edit distance can be easily normalized and converted into a matching score. This 
algorithm 0 can be summarized as follows: (1) Rotation and translation are 



92 



S. Prabhakar and A.K. Jain 



estimated by matching ridge segment (represented as planar curve) associated 
with each minutia in the input image with the ridge segment associated with each 
minutia in the template image. The rotation and translation that results in the 
maximum number of matched minutiae pairs within a bounding box is considered 
the correct transformation and the corresponding minutiae are labeled as anchor 
minutiae, .4i and A 2 , respectively. (2) Convert each set of minutia into a ID 
string using polar coordinates anchored at Ai and A 2 , respectively; (3) Compute 
the edit distance between the two ID strings. The matched pairs are retrieved 
based on the minimal edit distance between the two strings; (4) Output the 
normalized matching score which is the ratio of the number of matched-pairs 
and the number of minutiae points in the two sets. 

4.3 Matcher Dynamic 

This matching algorithm is a generalization of the above mentioned string algo- 
rithm. The transformation of a 2D pattern into a ID pattern usually results in 
a loss of information. Chen and Jain El have shown that fingerprint matching 
using 2D dynamic time warping can be done as efficiently as ID string editing 
while avoiding the above mentioned problems with algorithm String. The 2D 
dynamic time warping algorithm can be characterized by the following steps: (1) 
Estimate the rotation between the two sets of minutia features as in Step 1 of 
algorithm String; (2) Align the two minutia sets using the estimated parameters 
from Step 1; (3) Compute the maximal matched minutia pairs of the two minutia 
sets using 2D dynamic programming technique. The intuitive interpretation of 
this step is to warp one set of minutia to align with the other so that the number 
of matched minutiae is maximized; (4) Output the normalized matching score 
which is based on only those minutiae that lie within the overlapping region. 

4.4 Matcher Filter 

The four mains steps in the filter-based feature extraction algorithm are: (i) de- 
termine a reference point and region of interest for the fingerprint image. The 
reference point is taken to be the center point in a fingerprint which is defined 
as the point of maximum curvature of the ridges in a fingerprint. The region 
of interest is a circular area around the reference point. The algorithm rejects 
the fingerprint images for which the reference point could not be established, 
(ii) tessellate the region of interest. The region of interest is divided into sectors 
and the gray values in each sector are normalized to a constant mean and vari- 
ance. (iii) filter the region of interest in eight different directions using a bank of 
Gabor filters (eight directions are required to completely capture the local ridge 
characteristics in a fingerprint while only four directions are required to capture 
the global configuration). Filtering produces a set of eight filtered images, (iv) 
compute the average absolute deviation from the mean (AAD) of gray values 
in individual sectors in each filtered image. AAD value in each sector quantifies 
the underlying ridge structures and is defined as a feature. A feature vector, 
which we call FingerCode, is the collection of all the features (for every sector) 
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in each filtered image. Thus, the feature elements capture the local information 
and the ordered enumeration of the tessellation captures the invariant global 
relationships among the local patterns. The representation is invariant to trans- 
lation of the image. It is assumed that the fingerprint is captured in an upright 
position and the rotation invariance is achieved by storing ten representations 
corresponding to the various rotations (—45.0°, —45°, —33.75°, —22.5°, —11.25°, 
0°, 11.25°, 22.5°, 33.75°, 45.0°) of the image. Euclidean distance is computed be- 
tween the input representation and the ten templates to generate ten matching 
distances. Finally, the minimum of the ten distances is computed and inverted 
to give a matching score. The matching score is scaled between 0 and 99 and 
can be regarded as a confidence value of the matcher. 



5 Experimental Results 

Fingerprint images were collected in our laboratory from 167 subjects using an 
optical sensor manufactured by Digital Biometrics, Inc. (image size = 508 x 480, 
resolution = 500 dpi). A single impression each of the right index, right middle, 
left index, and left middle fingers for each subject was taken in that order. This 
process was then repeated to acquire a second impression. The fingerprint images 
were collected again from the same subjects after an interval of six weeks in a 
similar fashion. Thus, we have four impressions for each of the four fingers of 
a subject. This resulted in a total of 2,672 (167 x 4 x 4) fingerprint images. 
We call this database MSUJDBI. A live feedback of the acquired image was 
provided and the subjects were guided in placing their fingers in the center of the 
sensor in an upright position. A total of 100 images (about 4% of the database) 
was removed from the MSUJDBI because the filter-based fingerprint matching 
algorithm rejected these images due to failure in locating the center or due to 
a poor quality of the images. We matched all the remaining 2,572 fingerprint 
images with each other to obtain 3, 306, 306 ( ^572x2571 ^ matchings and called the 
matchings genuine only if the pair are different impressions of the same finger. 
Thus, we have a total of 3,298,834 (3,306,306 — 7,472) imposter and 7,472 
genuine matchings per matcher from this database. For the multiple matcher 
combination, we randomly selected half the imposter matching scores and half 
the genuine matching scores for training and the remaining samples for test. 
This process was repeated ten times to give ten different training sets and ten 
corresponding independent test sets. All performances will be reported in terms 
of ROC curves computed as an average from the ten ROC curves corresponding 
to the ten different training and test sets (two-fold cross validation repeated ten 
times). For the multiple impression and multiple finger combinations, the same 
database of 3,298,834 imposter and 7,472 genuine matchings computed using 
the Dynamic matcher was used. 

The ROC curves computed from the test data for the four individual finger- 
print matchers used in this study are shown in Figure Q The class separation 
statistic computed from the training data was 1.88, 1.87, 1.85 and 1.76 for the 
algorithms Dynamic^ String, Filter, and Hough, respectively, and is found to 
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Fig. 1. Performance of individual fingerprint matchers. 



Table 1. Combining two fingerprint matchers. CS is the class separation statistic. CS 
and p are computed from the training data. Ranks by ROC and ranks by A ROC are 
computed from the independent test data. 



Combination 


CS (rank) 


rank by ROC 


P 


rank by A ROC 


String -|- Filter 


1.95 (1) 


1 


0.52 


2 


Dynamic -\- Filter 


1.95 (1) 


2 


0.56 


3 


String -|- Dynamic 


1.94 (3) 


2 


0.82 


3 


Hough + Dynamic 


1.93 (4) 


4 


0.80 


6 


Hough + Filter 


1.91 (4) 


6 


0.53 


1 


Hough + String 


1.90 (6) 


5 


0.83 


5 



be highly correlated to the matching performance on the independent test set. 
Figure n shows that matchers Dynamic and String are comparable, Filter is 
better than Dynamic and String at high F ARs and slightly worse at very low 
FARs, and matcher Hough ranks the last. 

First, we combine the matchers in pairs of two. To combine two fingerprint 
matchers, we estimate the two-dimensional genuine and imposter densities from 
the training data. The two-dimensional genuine density was computed using the 
Parzen density estimation method. The value of window width (h) was empir- 
ically determined to obtain a smooth density estimate and was set at 0.01. We 
used the same value of h for all the two-matcher combinations. As a compar- 
ison, the genuine density estimates obtained from the normalized histograms 
were extremely peaky due to unavailability of sufficient data (only about 3, 780 
genuine matching scores were available in the training set to estimate a two- 
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Fig. 2. Two-dimensional density estimates for the genuine and imposter classes for 
String -\- Filter combination. Genuine density was estimated using Parzen window (h = 
0.01) estimator and the imposter density was estimated using normalized histograms. 



dimensional distribution in 10,000 (100 x 100) bins). However, for estimation 
of the two-dimensional imposter distribution, over 1.6 million matching scores 
were available. Hence, we estimated the two-dimensional imposter distribution 
by computing a normalized histogram using the following formula: 

1 ” 

p{X’^\w^) = -Y^5{X,X,) ( 2 ) 

1=1 

where 5 is the delta function that equals 1 if the raw matching score vectors X 
and Xj are equal, 0 otherwise. Here n is the number of imposter matchings from 
the training data. The computation time for Parzen window density estimate 
depends on n and so, it is considerably larger than the normalized histogram 
method for large n. The estimates of the two-dimensional genuine and imposter 
densities thus computed for String + Filter combination are shown in Figure 
El The class separation statistic for all pairs of matcher combination is shown 
in the second column of Table D the number in parenthesis is the predicted 
ranking of the combination performance based on CS. The actual ranking of 
performance obtained from the independent test set is listed in the third column 
marked ROC for ROC curves). As can be seen, the predicted ranking is very 
close to the actual rankings on independent test data. 

The following observations can be made from the two-matcher combinations: 

— Classifier combination improvement is directly related to the “independence” 
(lower values of p) of the classifiers. 
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— Combining two weak classifiers results in a large performance improvement. 

— Combining two strong classifiers results in a small performance improvement. 

— The two individually best classifier do not form the best pair. 




Fig. 3. The performance of the best individual matcher Dynamic is compared with 
the various combinations. The String + Filter is the best two-matcher combination 
and String + Dynamic + Filter is the best overall combination. Note that addition 
of the classifier Hough to the combination String + Filter results in a degradation of 
the performance. 



Next, we combine the matchers in groups of three and then combine all the 
four matchers together. The class separation statistic is maximum (1.97) for 
the String -|- Dynamic -I- Filter combination. From the tests conducted on the 
independent data set, we make following observations (see Figure 0). 

— Adding a classifier may actually degrade the performance of classifier com- 
bination. This degradation in performance is a consequence of lack of inde- 
pendent information provided by the classifier being added and finite size of 
the training and test database. 

— Classifier selection based on a “goodness” statistic is a promising approach. 

— Performance of combination is significantly better than the best individual 
matcher. 

Among all the possible subsets of the four fingerprint matchers, the class 
separation statistic is maximum for String + Dynamic + Filter combination. 
Hence, our feature selection scheme selects this subset for the final combination 
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and rejects the matcher Hough. This is consistent with the nature of the Hough 
algorithm, which is basically the linear pairing step in algorithms String and 
Dynamic, without the capability of dealing with elastic distortions. Therefore, 
Hough does not provide “independent” information with respect to String and 
Dynamic. The performance of the various matcher combinations on an indepen- 
dent test supports the prediction that String + Dynamic + Filter is the best 
combination. The proposed combination scheme either outperforms or matches 
the performance of the sum rule and outperforms the product rule in all the 
two- three- and four-matcher combinations because the proposed technique can 
produce a nonlinear complex decision boundary that is close to optimal. 

The performance of the combined system is more than 3% better than the 
best individual matcher. The matcher combination takes about 0.02 seconds on 
an Sun Ultra 1 in the test phase. In an authentication system, this increase in 
time will have almost no effect on the verification time and the overall matching 
time is still bounded by the slowest individual matcher. 

The performance improvement due to combination of two impressions of the 
same finger and the combination of two different fingers of the same person 
using the proposed strategy is 4% and 5%, respectively. The matcher Dynamic 
was used. The correlation coefficient between the two scores from two different 
impressions of the same finger is 0.42 and between two different fingers of the 
same person is 0.68 and is directly related to the improvement in the performance 
of combination. The CS for individual impressions is 1.84 and 1.87 respectively, 
and for the combination is 1.95. The CS for individual fingers is 1.87 and 1.86 
respectively, and for the combination is 1.98. Combination of two impressions of 
the same finger or two fingers of the same person using the proposed combination 
strategy is extremely fast. Therefore, the overall verification time is same as the 
individual matcher Dynamic. 



6 Conclusions and Discussions 

We have presented a scheme for combining multiple matchers (classifiers) at 
decision- level in an optimal fashion. Our design emphasis is on classifier selec- 
tion before arriving at the final combination. It was shown that one of the finger- 
print matchers in the given pool of matchers is redundant and no performance 
improvement is achieved by utilizing this matcher. This matcher was identified 
and rejected by the matcher selection scheme. In case of a larger number of 
classifiers and relatively small training data, a classifier may actually degrade 
the performance when combined with other classifiers, and hence classifier se- 
lection is essential. We demonstrate that our combination scheme improves the 
performance of a fingerprint verification system by more than 3%. We also show 
that combining multiple instances of the same biometric or multiple units of the 
same biometric characteristics is a viable way to improve the verification system 
performance. We observe that independence among various classifiers is directly 
related to the improvement in performance of the combination. 
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Abstract. We describe a multiple classifier system which incorporates 
an automatic self-configuration scheme based on genetic algorithms. Our 
main interest in this paper is focused on exploring the statistical proper- 
ties of the resulting multi-expert configurations. To this end we initially 
test the proposed system on a series of tasks of increasing difficulty drawn 
from the domain of character recognition. We then proceed to investi- 
gate the performance of our system not only in comparison to that of its 
constituent classifiers, but also in comparison to an independent set of 
individually optimised classifiers. Our results illustrate that significant 
gains can be obtained by integrating a genetic algorithm based opti- 
misation process into multi-classifier schemes both in the performance 
enhancement and in the reduction of its volatility, especially as the task 
domain becomes more complex. 



1 Introduction 

The comparative advantages offered by multiple classifier systems are, by now, 
well appreciated by the pattern recognition community 13, m Although a sig- 
nificant number of theoretical and empirical studies have been published to 
date, determining the optimal selection of constituent members for any par- 
ticular multi-expert scheme remains an open question. Additionally, taking into 
consideration the variety of task domains encountered and varying degrees of 
task complexity and information availability (as expressed by the training set 
sizes), optimal selection is not always straightforward. Normally, reconfigura- 
tion is needed, usually through laborious experimentation and intuition, each 
time the training set or the task domain is changed. Considerations about the 
diversity of the properties of different individual classifiers add further compli- 
cations to the reconfiguration process. For example, the generalisation capacity 
of some classifiers can lead to saturation as the training set size increases, result- 
ing in highly volatile performance. On the other hand, some classifiers require 
increased training set sizes to achieve acceptable levels of performance (a well 
known example in this category is the fc— nearest neightbours classifier). Finally, 
the variability of performance for some types of classifier leads to a significant 
dependence on the variability (complexity) of the particular task domain, while 
others seem to be more versatile. 
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In the light of these considerations it seems necessary to incorporate into a 
multiple classifier system a mechanism which will systematically attempt to opti- 
mise the structure every time the domain or the amount of available information 
changes. Such a mechanism should have the following properties: 

1. To be a global optimisation technique which has been shown to be successful 
in highly complex domains, such as the space of the possible configurations 
for a multiple classifier system given a pool of individual classifiers. 

2. To be generic in the sense that there will be no need for recoding each time 
the application domain or the training set size changes. 

3. To offer a straightforward way to express the problem of optimisation of the 
multiple classifier structure. 

Among the large number of the optimisation processes available, genetic algo- 
rithms (3 present a highly desirable choice which successfully fulfill the above re- 
quirements. Furthermore, initial reports of the application of genetic algorithms 
to multi-classifier systems have shown promising results P] and 0. 

In this paper we present a multiple classifier system which incorporates an au- 
tomatic self-configuration scheme based on genetic algorithms. Our main interest 
is focused on exploring the statistical properties of the resulting multi-classifier 
configurations. To this end we initially test the proposed system on a series of 
tasks of increasing difficulty drawn from the domain of character recognition. 
We then proceed to investigate the performance of our system, initially, in com- 
parison to that of its constituent classifiers, and finally and most importantly, 
in comparison to an independent set of individually optimised classifiers. In the 
following we will briefly describe the fusion scheme, the individual classifiers, and 
the genetic algorithm used, and we will proceed to discuss further the findings 
of our experimental investigation. 



2 The Fusion Scheme 

In the present work we draw the candidate members of the multi-classifier scheme 
from a pool of 12 individual classifiers which belong to four different classes. We 
use a parallel combination strategy where the constituent classifiers provide crisp 
decisions and the final classification is performed by a majority voting scheme 
where ties are broken arbitrarily. Diversity of classifiers is ensured by the use of 
either different training parameters or different feature sets, as can be seen in 
the following description. The classes of experts chosen are the following: 

Binary Weighted Scheme(BWS): This is a type of classifier which is based 
on simple n-tuple sampling D2|. Here, memory elements calculate a Boolean 
function based on the sampled n-tuples. The decisive training parameter in 
this case is the size n of the n-tuple used, which defines the quality and 
capacity of the resulting classifier. In our experiments we used sizes of 5 
(BWSl), 6 (BWS2), and 7 (BWS3). 
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Frequency Weighted Scheme(FWS): This is similar to the previous class 
in its structure, but, in this case, the memory elements calculate the relative 
frequency of occurrence of the sampled features ini- The important param- 
eter is again the size of the n-tuple used. We used again sizes of 5 (FWSl), 
6 (FWS2), and 7 (FWS3). 

Moment-based Pattern Classifier(MPC): This is a statistical classifier 
which explores the possible cluster formation with respect to a distance 
measure calculated on the nth order mathematical moments derived from 
the binarised patterns H2]. Crucial parameters are the order of the math- 
ematical moments and the distance metric used. Our experiments use 7th 
order moments and a Euclidean metric (MOMl), 6th order and Mahalanobis 
metric (MOM2), and 7th order and Mahalanobis metric (MOMS). 
/c-Nearest Neighbours Classifier(KNN): This algorithm is based on ini- 
tially determining the k instances from the training samples that are closest 
in a Euclidean n-dimensional space to the pattern to be classified. The classi- 
fication decision is chosen as the class to which the majority of these nearest 
neighbours belong HH. The number of neighbours considered and the fea- 
tures sampled form the set of important training parameters for this class. 
In this work we used the 5th order mathematical moments of the binarised 
images as the feature space and the three classifiers obtained had 1 (KNNl), 
3 (KNN2), and 5 (KNN3) effective nearest neighbours. 

3 The Set of Optimised Experts 

While it is common in the literature to compare the performance of a multiple 
classifier scheme to that of its constituent classifiers, we provide an additional test 
to examine the performance enhancement provided by our scheme. We used a set 
of individually trained classifiers which do not participate in the aforementioned 
pool. These have been optimised experimentally with respect to their effective 
parameters and the optimal training set size. The optimal performances achieved 
in each of the different task domains used in our experiments are presented in 
Table 0 This set of classifiers consists of two groups. To the first group belong 
classifiers from classes included in our pool of candidates for the combination 
scheme. These are a Frequency Weighted Scheme classifier (FWS) with an n- 
tuple size of 12 g], and a Moment-based Pattern Classifier which uses up to 
7th order mathematical moments of the patterns (indicated as MAXL in the 
Table, because it uses a Maximum Likelihood-based distance measure) ng. It 
should be noted here that these were included to provide a basis for comparisons 
with optimised versions of the classifiers used in the combination scheme. To 
the second group belong classifiers which come from classes often reported to 
provide excellent performance in the task domain of character recognition and 
they consequently present a valid benchmark for the performance obtained by 
our scheme. These classifiers are the following: 

Hidden Markov Model Classifier (HMM): This is a statistical classifier 
based on the simulation of the evolution of the states of a Markov Chain 
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|3- Its optimisation parameters include the number of states used and the 
number of symbols observed in each state. For the results presented we used 
12 to 16 states with 16 symbols per state. 

Scanning n-tuple Classifier (SNT): This is an enhancement of the basic n- 
tuple based class which works on the contour shape of the image represented 
using a chain coding scheme m- Again the critical parameters are the size 
of the n-tuple and the number of n-tuples used. Here we used five 5-tuples. 
Multilayer Perceptron (MLP): This is the well-known Artificial Neural 
Network architecture trained using Backpropagation Learning HSl. Essential 
parameter in this case is the number of hidden units. The MLP classifier we 
used had 40 hidden units. 

Moving Window Classifier (MWC): This is again an n— tuple based 
scheme which utilizes the idea of a window scanning the binarised image 
to provide partial classification indexes which are finally combined to obtain 
the overall classification decision p]. In this case the critical parameters in- 
clude the n-tuple size and the size of the moving window. Our results are 
based on an MWC using a 21x13 (pixels) window and n-tuples with size 12. 

4 Optimisation of the Configuration 

In genetic algorithms the solution to an optimisation problem is achieved as 
follows |0|: First, an initial set of possible solutions (population) to the problem 
encoded as bit strings (called chromosomes) are generated, usually randomly. 
Then, successive populations of solutions are constructed by applying to the 
previous population a number of transformations (called genetic operations), 
depending on their quality (called fitness of the chromosomes) as solutions to 
the particular problem in hand. Finally, the above recursion is terminated when 
a prespecified criterion (e.g. a level of fitness or a number of iterations) is met. 
The most commonly used genetic operators are: a) ’’selection” which determines 
which chromosomes will be selected to reproduce, b) ’’crossover” which defines 
how the children will be created from the parents (i.e. which parts of the parent 
chromosomes will be used in the children’s chromosomes), and mutation which 
attempts to infuse diversity into the population by randomly changing the value 
of randomly selected parts of chromosomes. The genetic operators we used are: 

Selection: We used the so-called classical Tournament method 0 to select 
chromosomes for reproduction. 

Crossover: We used a two point crossover operator. 

Mutation: This was applied in conjunction to the crossover operator. 

The choice of a relatively simple genetic algorithm was made in order to facilitate 
better interpret ability and clarity of our results. 

The search in the space of possible multi-classifier system configurations is 
naturally encoded in our case as 12-bit binary string, where every bit repre- 
sents an individual classifier with the value of 1 indicating participation in the 
combination scheme and 0 the opposite event. An exhaustive search in the set 



Genetic Algorithms for Multi-classifier System Configuration 103 



of all possible combinations involves excessively high computational time and 
load. The fitness function is, also, naturally defined as the recognition rate on 
an unknown evaluation set, as follows: 

Fitness = {Correctly classified patterns) / {All tested patterns) . 

It should be noted here that performing the optimisation based on the recogni- 
tion rate over the training set, instead of an unknown evaluation set, although 
a perfectly valid procedure, would not provide a fitness measure which would 
reflect the property we are interested in, namely the generalisation ability of the 
resulting classifier system. 

5 Experiments and Discussion 

The two databases we used reflect our main aim in this work which is to explore 
the properties of the genetic algorithm generated multi-expert schemes in a range 
of tasks with varying complexity and level of available information. Each of the 
databases consists of 34 classes of pre-segmented characters (numerals 0-9, and 
upper case letters A-Z, without differentiation between the pairs 0/0 and l/I). 
The first database (Dl) corresponds to machine-printed characters extracted 
from post codes on envelopes in the UK mail. The second (D2) corresponds to 
handwritten characters. In each database every class has 300 samples provided 
at a resolution of 16x24 pixels. 

The experimental procedure adopted was as follows: 

1 . The samples in each class for each database were randomly divided into three 
disjoint sets to form a training set, an evaluation set, and a testing set. While 
the sizes of the evaluation and test sets were kept constant at 50 samples 
to enable comparisons, there were three sizes of training sets used (50, 100, 
and 200) in order to test for the effect of providing additional information. 

2. The individual classifiers were trained. 

3. The genetic algorithm was applied to the pool of trained classifiers. The 
population had a size of ten configurations and the genetic algorithm ran 
for 5 generations with a probability of crossover 0.85, and a mutation prob- 
ability of 0.08. The two best individual configurations (chromosomes) were 
always copied to the next population to ensure that no deterioration in the 
performance will occur. The final configuration obtained is labelled Genetic 
in the Tables presented below. 

4. Two alternative configuration schemes were constructed for comparisons, the 
first consisting of all the classifiers available (denoted Combi in our Tables) 
and the second consisting only of the individual classifiers that achieved a 
recognition rate above 50% on the evaluation set (indicated as Comb2). 

5. Finally, all resulting solutions were tested on the disjoint test set. 

The above process was repeated 10 times with different random starting 
points in each case, for each one of the different training set sizes, and for each 
of the databases. The same experiments were repeated for both databases first 
using only a 10 class problem (the numerals) and then the whole of the available 
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Table 1. Recognition rate statistics (%) in Database D1 (10 classes). 





Training set size 




n = 


50 


n = 


100 


n = 


200 




Mean 


Std 


Mean 


Std 


Mean 


Std 


BWSl 


98.52 


0.6941 


98.86 


0.3134 


98.74 


0.3658 


BWS2 


98.42 


0.6070 


99.04 


0.5060 


99.02 


0.3938 


BWS3 


*98.66 


0.4904 


*99.14 


0.4006 


*99.14 


0.2836 


FWSl 


94.26 


1.2438 


95.36 


1.0741 


94.84 


0.7763 


FWS2 


94.72 


1.0881 


95.20 


1.1035 


94.86 


0.8996 


FWS3 


95.08 


1.0963 


95.84 


1.1108 


95.30 


0.7846 


MOMl 


57.66 


2.8814 


57.52 


1.7943 


56.82 


1.8606 


MOM2 


96.18 


0.7685 


96.84 


0.7531 


97.30 


0.7958 


MOM3 


95.54 


0.9383 


96.82 


0.7743 


97.48 


0.6613 


KNNl 


83.16 


2.2426 


86.94 


1.2580 


87.92 


1.3506 


KNN2 


84.06 


2.4295 


87.72 


1.2479 


88.44 


1.4261 


KNN3 


85.12 


2.6553 


88.30 


1.3004 


89.16 


0.9652 


Combi 


98.40 


0.7424 


98.92 


0.5594 


98.66 


0.3893 


Comb2 


98.40 


0.7424 


98.92 


0.5594 


98.66 


0.3893 


Genetic 


98.60 


0.8273 


99.46 


0.1897 


99.64 


0.2633 



34 classes (digits and upper case letters) to form a sequence of increasingly 
difficult problems. Our results represent the averaged performance of these 10 
runs in each of these cases (indicated as Mean in Tables 1-4). To examine the 
variability of these performances with respect to different training set sizes and 
on different task domains we calculated, also, the standard deviations based on 
these 10 runs (indicated as Std in the Tables). 

An initial inspection of the Mean recognition rates presented reveals that 
the configurations found by the genetic algorithm-based optimisation (indicated 
Genetic) are consistently better than the best individual classifier in the pool 
(indicated by * in the Tables). Also, in every case, these configurations outper- 
form the two alternative combination schemes proposed (denoted Combi and 
Comb2 in the Tables). Additional examination of the same Tables show that 
in most cases the variability of the performance of the genetic algorithm pro- 
posed configurations is smaller than the variability of the performances of the 
individual classifiers and the alternative combination schemes examined here. 

We can now examine the properties of the genetic algorithm-generated so- 
lutions as we move from less to more additional information availability (as ex- 
pressed by the training set sizes) for each of the task domains (machine-printed 
to handwritten character recognition). Clearly, for the easiest tasks (database 
Dl, Tables PEI) as the training set size increases the variability of the perfor- 
mances decreases in most of the cases. However, the same is not true for the 
most difficult tasks (database D2, Tables E] ^ where the variability inherent in 
the data sets is higher. 

Two important observations can be made here. First, some of the individual 
classifiers show signs of saturation in their generalisation capacity, presenting 
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Table 2. Recognition rate statistics (%) in Database D2 (10 classes). 





Training set size 




n = 


50 


n = 


100 


n = 


200 




Mean 


Std 


Mean 


Std 


Mean 


Std 


BWSl 


85.86 


1.3201 


84.04 


1.3978 


80.66 


4.3595 


BWS2 


87.88 


1.8861 


87.18 


1.3315 


86.14 


2.7746 


BWS3 


*87.98 


1.4093 


*89.06 


0.7058 


89.26 


2.6399 


FWSl 


79.60 


1.9956 


79.64 


1.6406 


81.20 


1.3367 


FWS2 


80.38 


2.3953 


80.16 


1.4010 


81.70 


1.5384 


FWS3 


79.58 


1.8390 


80.60 


1.3728 


82.74 


0.8435 


MOMl 


44.52 


1.7106 


43.70 


2.2652 


44.44 


1.9039 


MOM2 


81.20 


2.0000 


84.54 


1.7815 


87.76 


1.6487 


MOM3 


83.08 


1.1555 


86.58 


1.8023 


*89.84 


1.9202 


KNNl 


59.14 


2.0807 


63.98 


2.7397 


72.66 


6.7192 


KNN2 


60.44 


2.0887 


65.98 


2.5094 


74.08 


6.0282 


KNN3 


62.82 


2.1301 


67.40 


2.5820 


75.56 


5.2492 


Combi 


89.54 


1.3167 


89.68 


1.2621 


92.68 


2.8146 


Comb2 


90.00 


1.4996 


90.26 


1.0458 


93.06 


2.8799 


Genetic 


91.00 


1.4453 


93.94 


1.1664 


96.60 


0.8641 



higher standard deviations as the training set size increases. Typical examples 
are the three BWS classifiers, which in contrast to the FWS classifiers, present a 
constant or even decreased mean recognition rate but gradually increasing stan- 
dard deviations. Second, some of the classifiers reveal their high dependence on 
the training set size and their sensitivity to the variability in the data set. A 
typical example in this case is the KNN classifiers, for which, while the perfor- 
mance increases with the training set size, its variability increases as well. As 
a general observation we can see that all the combination schemes examined 
help to reduce the variability in performance as expressed by the standard de- 
viations. However, it is easily realised that additional benefit is gained, in terms 
of improving the stability of the observed performance, by the configurations 
produced by the genetic algorithm scheme, as we move to the more complex 
task domains (handwritten characters with 34 class problem). It is important, 
also, to note that in all our experiments the average number of classifiers cho- 
sen by the optimisation process to participate in the final solutions ranged from 
six (6) to eight (8). This suggests that, additional to the improvements in the 
performance, the genetic algorithm based optimisation can offer savings in the 
computational load required. 

Finally, we can move to the second principal question to be addressed, the 
comparative performance of the genetic algorithm generated solutions with re- 
spect to a set of independently optimised individual classifiers. The performances 
of the latter are presented in the top part of Table 0 while in the bottom part 
we present the minimum and maximum (indicated MINPOOL and MAXPOOL 
respectively) of the performances achieved by the classifiers participating in the 
pool for the configuration optimisation, in the corresponding cases (training set 
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Table 3. Recognition rate statistics (%) in Database D1 (34 classes). 





Training set size 




n = 


50 


n = 


100 


n = 


200 




Mean 


Std 


Mean 


Std 


Mean 


Std 


BWSl 


96.73 


2.0048 


97.50 


0.3825 


97.73 


0.5227 


BWS2 


*97.04 


1.9933 


98.04 


0.2882 


98.47 


0.3620 


BWS3 


96.98 


1.9212 


*98.24 


0.2958 


*98.67 


0.3236 


FWSl 


92.38 


2.0258 


93.11 


0.7614 


92.81 


0.3344 


FWS2 


92.59 


1.8017 


93.31 


0.8430 


93.23 


0.3892 


FWS3 


92.82 


2.2611 


93.41 


0.4878 


93.62 


0.4216 


MOMl 


38.85 


0.7966 


39.32 


1.0287 


39.85 


0.8448 


MOM2 


89.62 


3.3977 


93.45 


1.3382 


94.94 


0.6031 


MOM3 


90.90 


2.8450 


94.68 


1.4398 


96.11 


0.8181 


KNNl 


66.42 


0.9642 


70.88 


1.9108 


76.35 


3.7481 


KNN2 


67.95 


1.0895 


72.30 


1.7867 


77.66 


3.6160 


KNN3 


69.25 


1.1210 


73.88 


1.7251 


78.81 


3.4442 


Combi 


96.95 


2.0733 


97.77 


0.4509 


98.45 


0.6438 


Comb2 


97.15 


1.9927 


97.96 


0.4225 


98.60 


0.6551 


Genetic 


97.48 


1.8923 


98.82 


0.4205 


99.18 


0.5735 



size 200). It is easy to observe that the former outperform the latter, as in most 
cases the best performance of the classifiers in the pool (MAXPOOL) is worst 
than the worst obtained by the optimised group. In the top part of Table El 
we present the improvement in performances achieved by the genetic algorithm- 
based configurations, expressed as differences between the average recognition 
rates of the genetic algorithm produced solutions and the best individual classi- 
fier in the pool (the negative signs indicate deterioration in performance). The 
bottom part of the same Table (row 4) presents the corresponding performance 
differences between the genetically produced solutions and the best classifier in 
the optimised group. It is not difficult to see that the performance enhance- 
ment achieved by the solutions generated genetically increases significantly as 
the available information increases but, most importantly, also as we move to- 
wards more complex task domains. 



6 Conclusions 

This paper has presented a multiple classifier system which incorporates an au- 
tomatic self-configuration scheme based on genetic algorithms, and has further 
investigated some of its statistical properties in a range of tasks of increasing 
difficulty drawn from the domain of character recognition. Our main questions in 
this study were the relative stability of the performance of the proposed scheme 
and the possible performance enhancement that can be achieved, not only with 
respect to the participating individual classifiers, but also with respect to a di- 
verse group of classifiers which were individually and independently optimised 
for the tasks in hand. Our findings strongly suggest, in compliance with, and in 
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Table 4. Recognition rate statistics (%) in Database D2 (34 classes). 





Training set size | 




n = 


50 


n = 


O 

O 


n = 


200 




Mean 


Std 


Mean 


Std 


Mean 


Std 


BWSl 


70.81 


2.1925 


68.32 


3.8472 


63.15 


7.0037 


BWS2 


73.74 


1.7536 


73.42 


3.6810 


71.25 


7.2861 


BWS3 


*75.29 


1.3351 


75.85 


3.4415 


76.48 


6.6236 


FWSl 


58.12 


1.9307 


58.67 


1.3238 


59.39 


1.7851 


FWS2 


59.65 


1.4496 


59.57 


1.8719 


59.91 


1.7291 


FWS3 


59.75 


2.2004 


60.14 


2.1262 


60.33 


2.0437 


MOMl 


26.69 


1.2419 


26.55 


0.8637 


26.84 


0.6899 


MOM2 


69.11 


1.7130 


74.24 


2.2711 


76.44 


2.0000 


MOM3 


70.38 


1.6733 


*76.71 


2.5400 


*79.55 


2.3024 


KNNl 


40.38 


4.1742 


44.43 


6.1929 


50.28 


14.9791 


KNN2 


41.56 


4.2913 


45.84 


6.1871 


52.19 


14.3573 


KNN3 


43.54 


4.1489 


47.72 


5.6984 


54.30 


13.4194 


Combi 


77.58 


1.6930 


79.98 


3.8026 


80.62 


6.0651 


Comb2 


79.85 


1.7335 


81.84 


2.8193 


82.61 


5.2548 


Genetic 


81.67 


1.8372 


85.73 


1.4009 


91.38 


3.4990 



addition to, previously reported results, that the comparative benefits that can 
be gained by integrating a genetic algorithm-based optimisation scheme into the 
multiple classifier system, are significant and increase as the available informa- 
tion on a specific problem domain increases. The most important observation, 
however, is that the benefits of this approach to classifier design increase sub- 
stantially as task complexity increases. 
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Table 5. Average performance (%) of the optimised individual classifiers and the 
minimum and maximum performance of the classifiers in the pool of candidates. 





10 classes 


34 classes I 


D1 


D2 


D1 


D2 


FWS 


99.1 


91.1 


98.5 


80.0 


HMM 


97.9 


89.1 
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SNT 


99.5 


*95.8 
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80.0 
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81.1 
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MAXPOOL 
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Table 6. Improvements in recognition rate (%) 



Training 
set size 


10 classes 


34 classes | 


D1 


D2 


D1 


D2 
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3.02 


0.44 


6.38 


n = 100 


0.32 


4.88 


0.58 


9.02 


n = 200 


0.50 
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-0.02 
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Abstract. In this paper, we present a combined classification approach 
called the ‘virtual test sample method’. Contrary to classifier combina- 
tion, where the outputs of a number of classihers are used to come to 
a combined decision for a given observation, we use multiple instances 
generated from the original observation and a single classiher to compute 
a combined decision. In our experiments, the virtual test sample method 
is used to improve the performance of a statistical classifier based on 
Gaussian mixture densities. We show that this approach has some desir- 
able theoretical properties and performs very well, especially when com- 
bined with the use of invariant distance measures. In the experiments 
conducted throughout this work, we obtained an excellent error rate of 
2.2% on the original US Postal Service task. 



1 Introduction 

In this paper, we present a combined classification approach called the ‘virtual 
test sample method’ (VTS), which is based on the idea of using a single classifier 
to classify a set of observations which are known to belong to the same class. This 
approach is somewhat contrary to the common idea of classifier combination, 
where the outputs of different classifiers are combined to come to a final decision 
for a given observation. In our approach, a number of instances is created from 
the original observation using prior knowledge about the classification task. For 
example, in handwritten digit recognition, invariance to image shifts and other 
affine transformations plays an important role. Thus, VTS can be considered a 
counterpart of the common creation of virtual training data (‘perturbation of 
the training data’). In the experiments, it is used to improve the performance of 
a Bayesian classifier based on Gaussian mixture densities. We show that using 
VTS not only yields state-of-the-art results on the well known US Postal Service 
handwritten digit recognition task (USPS), but that it also has some desirable 
theoretical properties. 

In the next section, we describe the US Postal Service database used in our 
experiments and present some state-of-the-art results that were reported on this 
database in the last years. In Section 3, we briefly discuss the idea of classifier 
combination and one particular classifier combination scheme, namely the very 
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popular sum rule. In Section 4, the VTS method is presented and its theoretical 
properties are discussed. In Section 5, the statistical classifier we used in our 
experiments (in combination with VTS) is described. In this context, we will 
also discuss possibilities to incorporate invariances into the classifier which go 
beyond the use of virtual test samples. This is done by creating virtual training 
data and by using an invariant distance measure called tangent distance, which 
was proposed by Simard in 1993 m After presenting some experimental results 
in Section 6, the paper is ended by drawing some conclusions and giving an 
outlook to future work in Section 7. 

2 The US Postal Service Task Feature Analysis 

The USPS database (ftp://ftp.kyb. tuebingen.mpg.de/pub/bs/data/) is a 
well known handwritten digit recognition task. It contains 7,291 training ex- 
amples and 2,007 test examples. The digits are isolated and represented as 
16x16 pixels sized grayscale images (see Figure ^1. Making use of ‘appearance 
based pattern recognition’, we interpret each pixel as a feature in our experi- 
ments, resulting in 256-dimensional feature vectors. Because of this rather high- 
dimensional feature space, we optionally apply a linear discriminant analysis 
(LDA, 0) for feature reduction. As the maximum number of features that can 
be extracted by applying the LDA to a AT-class problem is K —1, we create four 
pseudoclasses for each USPS digit class by training a mixture with four densities 
using the algorithms described in Section 5. Thus, the resulting feature vectors 
are 39-dimensional 0. One of the advantages of USPS is that many recognition 
results have been reported by various research groups throughout the last years. 
Because of that, a meaningful comparison of the different classifiers is possible, 
with some results given in Tab. Q Error rates marked with an asterisk were 
obtained using a modified USPS training set, which - resulting in restricted 
comparability - was extended by adding 2,418 machine printed digits. 

3 The Idea of Classifier Combination 

The idea of classifier combination the following: Given a particular pattern recog- 
nition problem, the goal is usually to implement a system which achieves the 
best possible recognition results on unseen data. Thus, in many cases, a variety 
of pattern recognition approaches is evaluated and the one performing best is 
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Fig. 1. Example images taken from USPS. 
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Table 1. Results reported on USPS. 



Author 


Method 


Error [%] 


Simard+ 1993 


Human Performance 


2.5 


Vapnik 1995 


Decision Tree C4.5 


16.2 


Freund & Schapire 1996 


AdaBoost & Nearest Neighbour 


*6.4 


Simard'*' 1998 


Five-Layer Neural Net 


4.2 


Scholkopf 1997 


Support Vectors 


4.0 


ScholkopU 1998 


Invariant Support Vectors 


3.0 


Drucker+ 1993 


Boosted Neural Net 


*2.6 


Simard’'" 1993 


Tangent Distance & Nearest Neighbour 


*2.5 


This work: 


Gaussian Mixtures, Invariances 


2.2 



*: 2418 machine printed digits were added to the training set 



chosen to solve the task. Unfortunately - in that approach - all other systems 
that have been developed are useless. In opposite to this, the idea of classifier 
combination is to use all classifiers Cm ,rn = 1 , . . . , M for classification and to 
come to a final decision by combining the outputs in a suitable way (cp. Fig.Ej). 
In the last years, many combination approaches have been considered, among 
them the product rule, the sum rule, or the median rule, where in some cases 
the ‘vote’ of a classifier is weighted according to its performance on the train- 
ing set (i.e. boosting methods). Note that if such combination rules should be 
meaningful, the outputs of the classifiers must be normalized. Thus, we assume 
that - given the observation x S IR'° - each classifier Cm computes posterior 
probabilities Pm{k\x) for each class k = which are normalized in the 

sense that P(k\x) = 1. It should be noted that - for instance - the outputs 
of an artificial neural net approximate such posterior probabilities jiUI | , assuming 
that a sufficient amount of training data is available. Thus, normalization comes 
for free in many applications. 

For a single classifier, the Bayesian decision rule can be used for classification: 



, / N i n \ \^ j -P(x K I 

X !->■ r(x) = argmaxf nifc x) 1 = argmax < — ^ > , (1) 

. \ELiP(/^')-p(^lk')f 

where p(k) is the prior probability of class k and p{x\k) is the class conditional 
probability for the observation x given class k. Note that the denominator of 
Eq. O is independent of k and can be neglected for classification purposes. If 
different classifiers Cm are available (computing posterior probabilities Pm{k\x)) 
the final decision can - for instance - be obtained using the sum rule 



Pm{k\x) I . (2) 

Although Eq. 0 is widely accepted to yield state-of-the-art results in many 
applications, Kittler assumed in his derivation of the sum rule that the pos- 
terior probabilities Pm{k\x) computed by the different classifiers do not differ 
much from the prior probabilities p{k) [ 3 - In other words, the derivation of 



M 



X I— >■ r(x) = argmax 
k 



m—l 
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Fig. 2. Classifier combination (left) vs. the virtual test sample method (right). 

the sum rule for classifier combination is based on the strong assumption that 
the features extracted from the data contain no discriminatory information. In- 
terestingly, Kittler also observed that the good performance of the sum rule 
could possibly be explained by its error tolerance: Using the sum rule, errors in 
estimating the real (and therefore usually unknown) posterior probabilities are 
dampened, while for instance in the case of the product rule, these estimation 
errors are amplified 0. 

If no set of classifiers exists for combination, techniques like ‘bagging’ Q 
or ‘boosting ’ m exist, which generate a variety of classifiers using different 
subsets (bagging) respectively differently weighted versions of the training data 
(boosting) for training. Here, it is assumed that the classifiers are ‘instable’, i.e. 
that modifications of the training set have a significant impact on the resulting 
classifier. Otherwise, combination of the resulting classifiers would be pointless. 

4 The Virtual Test Sample Method 

The basic idea of the virtual test sample method is to create a number of ‘vir- 
tual test samples’ starting from the original observation, to classify each of these 
separately using a single classifier C and to suitably combine these decisions to 
a final decision for the original observation (cp. Fig. I2J. In handwritten digit 
recognition, invariance to affine transformations is usually desired, but gener- 
ally speaking all transformations respecting class membership can be consid- 
ered here. Thus, given the observation x, we can create virtual test samples 
x{a) = t(a;, a), a G M, with M = \M\, where t{x,a) is a transformation with 
parameters a G In the experiments, ±I pixel shifts were applied, i.e. M = 9 
(eight shifts and the original image) . As an image cannot be shifted into different 
directions at the same time, the resulting ‘events’ x{a), a € A4 can be regarded 
as being mutually exclusive and a final decision can be computed as follows: 

r(x) = argmax {p(A:|a;)} 

k 



X 
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argmax < p{k, a\x) 






a^M 



argmax < p{ct\x) ■p{k\x^a) 

^ L 



^aGAt 



argmax < p{o) ■ p{k\x{a)) 

^ laGAI 



( 3 ) 



Here, the simultaneous occurrence of an observation x and a parameter vector 
a S is modeled by the virtual test sample a;(a), i.e. by applying the respective 
transformation to the observation. Furthermore, a is assumed to be independent 
of X. Thus, to come to a final decision for the original observation, we only have 
to add the posterior probabilities p{k\x{a)), weighted with the prior probabilities 
p{a) of the transformation parameters. In the experiments conducted throughout 
this work - if nothing else is said - these transformation parameters are assumed 
to be uniformly distributed. Thus, the prior probabilities p{a) can be neglected 
for classification purposes and Eq. 0 reduces to 



r{x) = argmax ■ 

k 



p{k\x{a)) 



( 4 ) 



.aeM 



The only assumption needed here is the mutual exclusiveness of the virtual test 
samples. As each of these is the result of applying a unique transformation to the 
given observation, this assumption seems reasonable. This ‘virtual test sample 
method’ has a number of desirable properties: 

Computational Complexity: 

The computational complexity of the VTS recognition step is the same as that 
of classifier combination. Yet, only one classifier has to be trained in the VTS 
training phase, which is especially important for statistical classifiers, where the 
training step is computationally expensive in many cases. 

Theoretical Basis: 

In contrast to the derivation of the sum rule in the framework of classifier com- 
bination, VTS sum rule is straightforward to derive, with the assumption of 
mutual exclusiveness of the a; (a) sounding reasonable. 

Increased Transformation Tolerance/ Invariance: 

Obviously, invariance properties with respect to the transformations used for 
virtual test data creation are incorporated into the classifier. 

Ease of Implementation & Effectiveness: 

Assuming a suitable normalization of the classifier’s output, VTS is very sim- 
ple to embed into an existing classifier. Furthermore, using VTS significantly 
reduced the error rates in the experiments conducted throughout this work. For 
real-time applications, VTS is straightforward to parallelize (just like classifier 
combination), as it is inherently parallel. 
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Applicable together with Classifier Combination: 

In principle, VTS and classifier combination can be used at the same time. Doing 
so, our best VTS result could in fact be slightly improved (cp. Section 7). 

Incorporation of Prior Knowledge about Transformation Probabilities: 

Finally, it is possible to incorporate prior knowledge into VTS classification via 
an appropriate choice of the probabilities p{a) (our model) respectively p{a\x). 
For instance, in a statistical framework as the one presented in the next section, 
these probabilities could be learned from the training data. 



5 The Statistical Classifier 



In this section, we describe the statistical classifier which we used in combi- 
nation with the VTS method in our experiments. To classify an observation 
X S IR^, we use the Bayesian decision rule as given in Eq. (^J), which is known 
to minimize the number of expected classification errors in the case that the 
true distributions p{k) and p{x\k) are known. Naturally, as these are unknown 
in most practical applications, we have to choose models for them and learn the 
respective parameters using the training data. In the experiments, we choose 
p(fc) = V, A: = and model the class conditional probabilities p{x\k) 

using Gaussian mixture densities (GMD) (see Eq. Q) respectively Gaussian 
kernel densities (GKD). In order to keep the number of free model parame- 
ters small (and thus allow for reliable parameter estimation), we make use of a 
globally pooled covariance matrix 



K Ik 






k—l i—1 



Nki 

w 



■^ki 5 



( 5 ) 



where is the covariance matrix of component density i of class k and N^i is 
the number of observations that are assigned to that particular density. Thus, 
we obtain the following expression for the class conditional probabilities: 

Ik 

p{x\k) = '^Cki- M{x\p.ki,S), ( 6 ) 

i=l 



where Ik is the number of component densities used to model class k, Cki are 
weight coefficients (with Cki > 0 and = 1, which is necessary to ensure 

that p{x\k) is a probability density function) and fiki is the mean vector of 
component density i of class k. Furthermore, we only use a diagonal covariance 
matrix, i.e. a variance vector. Note that this does not lead to a loss of information, 
since a Gaussian mixture of that form can still approximate any density function 
with arbitrary precision. Maximum likelihood parameter estimation is now done 
using the Expectation-Maximization algorithm P| . Goncerning Gaussian kernel 
densities it should be pointed out that these can be regarded an extreme case of 
a Gaussian mixture, where each reference sample Xn defines a Gaussian normal 
distribution Af{x\xn, V) jH]. 
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Note that the approach presented above is only invariant with respect to 
transformations that are present in the training data. In the following, we there- 
fore briefly describe two possibilities to enhance the invariance properties of the 
statistical classifier that go beyond the usage of VTS. 

5.1 Creation of Virtual Training Data 

A typical drawback of statistical classifiers is their need for a large amount of 
training data, which is not always available. A common solution for this problem 
is the creation of virtual training data. Here, just like for the VTS method, ±1 
pixel shifts were chosen, resulting in 9 • 7,291=65,619 reference samples. 

5.2 Incorporating Invariant Distance Measures 

Another way to incorporate invariances is to use invariant probability density 
functions or - equivalently - invariant distance measures Pj . Here, we choose tan- 
gent distance (TD), which proved to be especially effective for optical character 
recognition. In m the authors observed that reasonably small transformations 
of certain objects (like digits) do not affect class membership. Simple distance 
measures like Euclidean or Mahalanobis distance do not account for this and are 
very sensitive to transformations like translations or rotations. When an image x 
of size / X J is transformed (e.g. scaled and rotated) with a transformation t(x, a) 
which depends on L parameters a S (e.g. the scaling factor and the rotation 
angle), the set of all transformed images = {t{x,a) : a G IR'^} C is 

a manifold of at most L dimensions. The distance between two images can now 
be defined as the minimum distance between their according manifolds, being 
truly invariant with respect to the L transformations regarded. Unfortunately, 
computation of this distance is a hard optimization problem and the manifolds 
needed have no analytic expression in general. Therefore, small transformations 
of an image x are approximated by a tangent subspace to the manifold 
at the point x. Those transformations can be obtained by adding to x a linear 
combination of the vectors xi,l — 1, ..., L that span the tangent subspace. Thus, 
we obtain as a first-order approximation of Mx'. 



Now, the single sided TD Dt{x,^) between two images x and /x is defined as 



The tangent vectors xi can be computed using finite differences between the 
original image x and a small transformation of x m. Furthermore, a double 
sided TD can also be defined by approximating and M^. In the experiments, 
we computed seven tangent vectors for translations (2), rotation, scaling, axis 
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Fig. 3. Prior probabilities p{a) chosen for image shifts in the experiments. For example, 
the prior probability for the original (i.e. unchanged) image is (4/16) = 0.25. 

deformations (2) and line thickness as proposed by Simard M- Given that the 
tangent vectors are orthogonal, Eq. Q can be solved efficiently by computing 

= (9) 

1 = 1 Ill'll 

The combination of TD with virtual data creation makes sense, as TD is only 
approximately invariant with respect to the transformations considered. Thus, 
creating virtual training data yields a better approximation of the original mani- 
fold, as the virtual training images lie exactly on it. 

For the calculation of the Mahalanobis distance, the observation x and the 
references pki are replaced by the optimal tangent approximations x{aopt) re- 
spectively Pki{c(opt) in the TD experiments. When calculating single sided TD, 
the tangents are applied on the side of the references. Note that TD can also be 
used to compute a ‘tangent covariance matrix’, which is defined as the empirical 
covariance matrix of all possible tangent approximations of the references |2I. 
Further information on a probabilistic interpretation of TD is given in 0 . 

6 Results 

The experiments were started by applying the statistical approach described in 
Section 5 to the high-dimensional USPS data, using different combinations of 
virtual training and test data. In Tab. |21 the notation ‘a-6’ indicates that we 
increased the number of training samples by a factor of a and that of the test 
samples by a factor of b. Thus, b=9 indicates the use of VTS. As can be seen, VTS 
significantly reduces the error rate on USPS from 8.0% to 6.6% (without virtual 
training data) respectively from 6.4% to 6.0% (with virtual training data). These 
error rates can be further reduced by applying an LDA. Thus, the best error rate 
decreases from 6.0% to 3.4%, which is mainly because parameter estimation is 
more reliable in this rather low-dimensional feature space. Note that in this case, 
applying VTS reduced the 9-1 error rate from 4.5% to 3.4%, being a relative 
improvement of 24.4%. This error rate could be slightly improved to 3.3% by 
assuming a Gaussian distribution for the prior probabilities p{a), resulting in the 
template depicted in Figure 0 As a key experiment, we boosted the statistical 
classifier based on 39 LDA features using AdaBoost.Ml ^ for M = 10. Indeed, 
we were able to reduce the 9-1 error rate from 4.5% to 4.2%, yet VTS - by 
reducing the error rate from 4.5% to 3.4% - significantly outperformed AdaBoost 
on this particular dataset. 

In another experiment, we investigated on the use of TD in combination with 
VTS. These experiments were performed in the high-dimensional feature space 
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Table 2. USPS results for Mahalanobis/ tangent distance, with/ without LDA. 



Method: 


Error rate [%] 




1-1 


1-9 


9-1 


9-9 


GMD, 


8.0 


6.6 


6.4 


6.0 


GMD, LDA, 


6.7 


5.9 


4.5 


3.4 


GMD, tangent distance 


3.9 


3.6 


3.4 


2.9 


GKD, tangent distance 


3.0 


2.6 


2.5 


2.4 



(no LDA), as TD in its basic form is defined on images (although it can also 
be defined on arbitrary feature spaces, where the tangent vectors are learned 
from the data itself |3)- Using single sided TD, the best error rate could be 
reduced from 3.4% for the LDA based statistical classifier to 2.9% (using single 
sided TD and about 1,000 normal distributions per class). This error rate could 
be further reduced to 2.7% by using double sided TD. Replacing the mixture 
density approach by kernel densities, the VTS error rate was reduced to 2.4%. In 
these experiments, standard deviations were used instead of diagonal covariance 
matrices. Finally, by combining the outputs of five VTS based kernel density 
classifiers (using different norms in the distance calculations and different kinds 
of training data multiplication), the error rate could be further reduced to 2.2%. 
To make sure that these good results are not the result of overfitting, we also 
applied our best kernel density based USPS classifier (error rate 2.4%) to the 
well known MNIST task without further parameter tuning, obtaining a state- 
of-the-art result of 1.0% (1.3% without VTS). Although this result is not the 
best known on MNIST (the best error rate of 0.7% was reported by Drucker 
in ^), it shows that the algorithms presented here generalize well. 

7 Conclusions &; Outlook 

In this paper, we presented a combined classification approach called the ‘virtual 
test sample method’, which is based on using a single classifier in combination 
with artificially created test samples. Thus, it can be regarded as a counterpart 
to the creation of virtual training data, which is a common approach in pattern 
recognition. We showed that the proposed method is straightforward to justify 
and has some desirable properties, among them the possible incorporation of 
prior knowledge and the fact that only a single classifier has to be trained. In 
our experiments, the approach was used to improve the performance of a statis- 
tical classifier based on the use of Gaussian mixture densities in the context of 
the Bayesian decision rule. The results obtained on the well known US Postal 
Service task are state-of-the-art, especially when the virtual test sample method 
is combined with the incorporation of invariances into the classifier, which was 
done by using Simard’s tangent distance and resulted in an error rate of 2.2%. 
Finally, the approach seems to generalize well, as a state-of-the-art error rate 
of 1.0% was also obtained on the MNIST handwritten digit task. Besides devel- 
oping better models for the probabilities p{a) respectively p{a\x) considered in 
the virtual test sample method, future work will also include investigating its 
effectiveness in other pattern recognition domains. 
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Abstract. We present a learning algorithm for two-class pattern recog- 
nition. It is based on combining a large number of weak classifiers. The 
weak classifiers are produced independently with diversity. And they are 
combined through a weighted average, weighted exponentially with re- 
spect to their apparent errors on the training data. Experimental results 
are also given. 



1 Introduction 

“Averaging” has been a useful technique for constructing ensembles. Well known 
examples are bagging Bayesian averaging H2|, and Stochastic Discrimination 
(SD) unj. A recent paper ^ also uses this technique. 

“Weak classihers” generated by using the training data set first appeared in 
0. In the literature there is a considerable amount of research work using the 
idea of weak classifiers to form strong classifiers. See, for example, 0, 0, 
13 , jOj and imj. In |2|, 0 and weak classifiers are obtained by finite unions 
of rectangular regions in the feature space and they are combined through the 
average of the base random variables. In 0, weak classifiers are obtained by 
finite unions of rectangular regions and they are combined by majority vote. 
In jZj, weak classifiers are linear classifiers, generated by random selections of 
hyperplanes, and they are combined by majority vote. In [0|, multiple trees are 
constructed systematically by pseudorandomly selecting subsets of components 
of the feature vectors, and the combination is done through majority vote for 
fully split trees. The combined results from the methods cited above are often 
remarkable. 

In this paper we present a binary classification learning algorithm by pro- 
ducing a number of weak classifiers from SD and combining them through a 
weighted average employed in 0- The motivation to use weak classifiers from 
SD is that they are computationally cheap and easy to obtain. We use the “av- 
eraging” method in 0 to combine weak classifiers simply because it has several 
appealing properties. First, predictions of final classifiers from this method are 
stable and reliable. Second, the method has the potential that the final clas- 
siher may signihcantly outperform the best among the set of weak classifiers. 
The work presented in this paper can be viewed as not only an application of 
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weak classifiers from SD, but also an implementation of the algorithm carefully 
studied in 

This paper is organized as follows. Section 2 describes the algorithm. Section 3 
presents some experimental results. Our conclusion is given in Section 4. 



2 The Algorithm 

In this section, we present our algorithm for binary classification. First we show 
how to produce weak classifiers. And then we show how to combine them to 
form a strong classifier. 



2.1 Weak Classifiers 

Consider the standard two-class pattern recognition problem. We have a train- 
ing set of independently and identically distributed observations: • • •, 

(x„,?/„), where x^ is a feature vector of dimension p, belonging to the feature 
space F, and G {1, 2} is a class label. Alternatively, we may write the training 
set as TR = TA 2 }, where TRk is a, given random sample from class k for 

k = 1,2. 

Based on the training data set TR = {Ti?i, TR 2 }, we can produce a sequence 
of weak classifiers using the rectangular regions (j5j). A rectangular region in R 
is a set of points {xi,X 2 , ■ ■ ■ , Xp) such that ai < Xi < bi for i = 1, ■ ■ ■, p, where 
ai and bi are real numbers. For simplicity, a rectangular region is denoted by 

nr=iK-^i)- 

Suppose At is a fixed natural number and A(A > 1), p (0 < p < 1), and f3 
(0 < /3 < 1) are fixed real numbers. Let Jii be the smallest rectangular region 
containing TR. Let 5R be the rectangular region (in i?’’) such that 5R and 5Ri 
have the same center, 5R is similar to Jii and the “width” of 5R along the xt- 
axis is A times the corresponding width of 3?i. Suppose that K = nr=i('^*> ^*)- 
Inside 5R, one randomly chooses a training feature vector q = and 

numbers U and Ui such that Li < li < qi < Ui < Ui and Ui — h > p{Ui — Li) 
for i = 1, . . p. Form a rectangular region R = ntiih ,Ui). For any subsets 
Ta and Th of F, let r{Ta,Tb) denote the ratio of the number of common feature 
vectors in and Tf, and the number of feature vectors in Ti,. For example, if Tj, 
contains 5 feature vectors and Ta and Th have 3 feature vectors in common, then 
r(Ta,Th) = 3/5 = 0.6. An S' is a weak classifier if S is a union of n rectangular 
regions i?’s constructed above which satisfies |r(S, Ti?i) — r(S, Ti? 2 )| > P- 

A classification rule is a map from the feature space F to {1,2}, the set of 
class labels. A weak classifier S gives a classification rule fs as follows. Given 
X G S', if r{S,TRi) > r(S, TR 2 ), />s(x) = 1 for any x G S and />s(x) = 2 for any 
X G S“; if r(S, TR 2 ) > f{S,TRi), </s(x) = 2 for any x G S and </s(x) = 1 for 
any x G S'^. For convenience, 4>s is also called a weak classifier. The error rate 
on TR of a weak classifier S or (fs can be close to 0.5. Thus a weak classifier 
may be very weak in terms of classification error. 



Averaging Weak Classifiers 



121 



In this paper, we use as weak classifiers the finite unions of rectangular re- 
gions defined previously. From the theoretical point of view, weak classifiers may 
be formed by using other geometric objects such as spheres and half-planes. One 
key in constructing weak classifiers is that each weak classifier should be topolog- 
ically “thick.” A “thicker” weak classifier contains more training feature vectors 
than a “thinner” one. In this paper. A, p, and k are used to determine the 
“thickness” of weak classifiers. In the theory of Kleinberg’s SD, other important 
concepts related to weak classifiers include enrichment and uniformity. The ef- 
fect of enrichment is picked up by (3. Uniformity is not discussed in this paper 
since it seems that the empirical log ratio I (See Section 2.2) does not require 
this. For the discussion on various aspects of weak classifiers, we refer the readers 
to P], j0|, and [ I I j . 



2.2 Combining Weak Classifiers 

Suppose that there is a sequence of weak classifiers Si, ■■■, St, which are produced 
independently. Correspondingly we may rewrite this sequence of weak classifiers 
as 4>i, ■ ■ ■, 4>t, where (f)k = 4>Sk for k = 1, • • • ,t. It is seen that 4>i, ■ ■ ■, tpt are 
also diverse. Here diversity is used to indicate the following: if t is large, then for 
any feature point x the probability is very high that there exist a pair </>j and 
(j)j such that they make different errors in classifying x. To combine these weak 
classifiers, one can use the empirical log ratio in For each weak classifier (j>, 
denote by the apparent error of 4> on the training data, and define a weight 
where 77 is a positive constant. For a new feature vector x, the 
value of the prediction function based on <^i, • • •, is the value of the empirical 
log ratio: 



Also following |^, we define the classification rule as follows. Let Z\ be a non- 
negative constant. Given x G F, classify x into class 1 if l{x) > A, classify x into 
class 2 if l{x) < — A, and the status of x is “no prediction” if Z\ < Z(x) < — A. “No 
prediction” means that the information is insufficient to make a classification. 

Denote the above classification rule by AWC, short for Averaging Weak Clas- 
sifiers. Clearly a natural question is: what performance can one expect AWC to 
achieve? From 0, we see that the generalization error of AWC is close to that 
of the best (f in the sequence 4>i, ■ ■ ■, (ft. jS] also points out that in some cases 
AWC may significantly outperform the best (f among cj>i, ■■■, (ft - For example, let 
us consider the following scenario. Imagine one has a sequence of classifiers (fi, 
■ ■ ■, (ft, where (ffs are not necessarily weak classifiers as defined in Section 2.1. 
Suppose that from this sequence there is one (f* such that e{(f*) = 1/10, and 
that e{(f) = 1/5 for each (f G {(fi, ■ ■ ■, (ft} ~{(f*}. Also suppose that for each x, 
the fraction oi (f G {(fi, ■ ■ ■, (ft} ~{(f*} giving the right label is approximately 
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3/4. Let T] = 1. Then if </(x) = 1, which is the correct label for x, we have 



Z(x) Ri In 



/ e-i/io + (3/4)(t-i)e-i/5\ 
(l/4)(f-l)e-i/5 ) 



= In 



4ei/io\ 

t-l) 



In 3 



as t is large. Similarly, if </(x) = 2, which is the correct label for x, ^x) Ri 
— In ^3 + ^ Ri — In 3 as t is large. If </>(x) = 1, which is a wrong label for x, 

Z(x) r; In + 2 (t-i ) ) ~ ~ ln3 as t is large. And if i/(x) = 2, which is a wrong 

label for X, Z(x) r; — In R In 3 as t is large. The above simply shows 

that if t is large enough, AWC (for ?7 = 1) classifies all the feature points correctly. 
In this example, the classifiers are averaged almost uniformly. ^ states that if 
in some cases there are more (/’s with low error rates, the balance between the 
two sets becomes more delicate. This explains why we produce weak classifiers 
and then put them into the empirical log ratio 1. Suppose we have a sequence 
of weak classifiers 4>i, • • •, 4>t. Let Si denote a set of 4>’s with low apparent error 
rates and Sf denote the set of all the remaining Then our hope is that as t 
is large the balance between S[ and Sf will make AWC a classification rule with 
a good generalization performance. 

To conclude this section, we summarize our binary classification algorithm 
as follows. 



1. Given A, p, k, and /3, generate t weak classifiers Si, ■ ■ •, St- Set for 

k = 1, • • • , t. 

2. Given ry, calculate the weights • • •, w{4>t) = 

3. For any x, calculate the prediction value ?(x). Classify x with given A. 



3 Experimental Results 

In this section, we report the experimental results on classifying feature vectors 
from several problems. In example 1 a simulated dataset was used. The datasets 
in examples 2-4 were taken from the UCI Machine Learning Depository. The 
choices of the parameters of our learning algorithm were made as follows. In 
order to compare our results with those available in the literature, we set A = 0 
in our algorithm so that “no prediction” actually did not occur and all the test 
points were classified. Also before the experiments we fixed p = 10. 

Note that p, A, and k together determine the “thickness” of each weak clas- 
sifier. For convenience we fixed p — 0.3 and tuned A and k. (Note that f3 tells us 
that a weak classifier covers the two classes of feature vectors in the training set 
differently. Therefore, to a certain degree, /3 contains the information about the 
accuracy of each weak classifier.) Tuning A, k and (3 was done through 10-fold 
cross-validation for examples 2 — 4 or the usual training/test procedure for the 
simulated data in example 1. This tuning process consisted of two steps. In step 1, 
we conduced a coarse tuning. We considered A G [1.0, 2.0], k G {2, • • • , 30}, and 
P G [0.1, 0.9]. We ran AWG (t = 200) for each choice of A, k and /3 by looping 
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over the ranges with a step size of 0.1, 2, and 0.1 for A, k, and /3, respectively. 
For example 1, we selected the combination of A, k and j3 corresponding to the 
best performance on the test dataset. For examples 2-4, we selected the com- 
bination of A, K, and /3 corresponding to the best averaged test performance. 
Denote the selected parameters by Aq, kq, and /3 q. In step 2, we conducted a 
fine tuning. We considered new ranges of A, k and (3 centered at Aq, kq, and /3o, 
respectively. The length of each range was set to be half of that in step 1. An 
obvious truncating was done if the new range would extend beyond the range in 
step 1. We ran AWC (again t = 200) for each choice of A, k and (3 by looping 
over the ranges with a step size of 0.05, 1, and 0.05 for A, k, and [3 respectively. 
As in step 1, we chose the combination of A, k, and [3 corresponding to the best 
test performance as the fine tuning result. Denote the selected parameters by 
Ai, Ki, and j3\. These values were used with the actual runs of the experiments. 

With all the known parameters, we ran AWC to estimate the test error rates. 
For each necessary run, t = 1000 weak classifiers were produced. 

Example 1 ( Two normal populations with equal covariance matrix ) 

Consider two distributions I) (class 1) and N{fi 2 , I) (class 2), where 

I is the 2x2 identity matrix, fii is the vector (1.5,0)', and /^2 is the vector 
(0,0)'. Both of the prior class probabilities tti (for class 1) and tt 2 (for class 
2) are equal to 1/2. The training set contains 400 points from each class, and 
test set contains 1000 points from each class. The averaged results over ten such 
independently drawn training/test set combinations were used to estimate the 
error rates. Parameter values from the tuning process are A = 1.05, k = 5, and 
[3 = 0.40. The averaged test error from the “averaging” method described in this 
paper is 23.3% . As a comparison, the linear discriminant rule yields a test error 
23.2%. 

Example 2 ( Breast cancer) 

The dataset was taken from the UCI Machine Learning Depository. The data 
came from Dr. William H. Wolberg, University of Wisconsin Hospitals, Madison 
m)- The dataset contains 699 points in the nine-dimensional space i?®, coming 
from two classes: benign (458 cases) and malignant (241 cases). We used (from 
the tuning) A = 1.05, k = 3, and f3 = 0.85. The test error based on 10-fold 
cross-validation is 3.8%. 

Example 3 (Diabetes) 

The data were gathered among the Pima Indians by the National Institute of 
Diabetes and Digestive and Kidney Diseases (H3|). The dataset contains 768 
points in the eight-dimensional space i?®. There are two classes: tested positive 
(268 cases) and negative (500 cases). We used A = 1.00, k = 4, and j3 = 0.30. 
The test error based on 10-fold cross-validation is 25.2%. 

Example 4 (Hepatitis) 

The data were from Gail Gong at Garnegie Mellon University. The dataset con- 
tains 155 points with dimension 19. There are two classes: die (32 cases) and 
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live (123 cases). We used A = 1.50, k = 20, and /? = 0.30. The test error based 
on 10-fold cross-validation is 17.3%. 

Table 1 presents the results for our runs, as well as for other “weak learning 
algorithms.” We used the same notations as in m- In the table, the results 
of the first nine columns are from 0. Specifically, FIA, FID, and C45 repre- 
sent three learning algorithms FindAttrTest, FindDecRule, and Quinlan’s C4.5, 
respectively. Their boosted versions are denoted by ABO, DBO, and 5BO, re- 
spectively. And Their bagged versions are denoted by ABA, DBA, and 5BA, 
respectively. The tenth column (SDK) reports the results of Stochastic Discrim- 
ination from m- Our results from averaging weak classifiers are given in the 
last column. From the table, we can see that the test errors of AWC obtained in 
examples 2-4 are comparable to the best results. 



Table 1. Test error rates. The first nine columns contain the results of FindAttrTest, 
FindDecRule, C4.5, and their boosted and bagged versions from Freund and Schapire, 
the 10th column (SDK) contains the results of Stochastic Discrimination from Klein- 
berg, and the last column reports the results of the method of averaging weak classifiers. 



dataset 


FIA 


ABO 


ABA 


FID 


DBO 


DBA 


C45 


5BO 


5BA 


SDK 


AWC 


breast cancer 


8.4 


4.4 


6.7 


8.1 


4.1 


5.3 


5.0 


3.3 


3.2 


2.6 


3.8 


diabetes 


26.1 


24.4 


26.1 


27.8 


25.3 


26.4 


28.4 


25.7 


24.4 


25.5 


25.2 


hepatitis 


19.7 


18.6 


16.8 


21.6 


18.0 


20.1 


21.2 


16.3 


17.5 


16.2 


17.3 



The above examples are merely used to demonstrate the effectiveness of the 
method presented in this paper. We have not tried to find the best setting of the 
parameters A, p, k, and /3 . Although the tuning process proposed above (when 
p is fixed) often works, it is in no way the best procedure. As better methods 
are found to pick up the parameters, the classification results will definitely 
be improved. AWC has also been applied to many other binary classification 
problems. Experiments show that classification results depend on the parameters 
and that carefully tuned parameters often lead to excellent results. 

4 Conclusion 

Combination of classifiers is a rich research area in pattern recognition. In this 
article, we combine an arbitrary number of weak classifiers through a weighted 
average. Experimental results show that the ensemble constructed this way is 
comparably accurate in classifying feature vectors and overfitting does not occur 
for the ensemble. 
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Abstract. One approach to deal with real complex systems is to use two or 
more techniques in order to combine their different strengths and overcome 
each other’s weakness to generate hybrid solutions. In this project we pointed 
out the needs of an improved system in toxicology prediction. An architecture 
able to satisfy these needs has been developed. The main tools we integrated are 
rules and ANN. We defined chemical structures of fragments responsible for 
carcinogenicity according to human experts. After them we developed 
specialized rules to recognize these fragments into a given chemical and to 
assess their toxicity. In practice the rule-based expert associates a category to 
each fragment found, then a category to the molecule. Furthermore, we 
developed an ANN-based expert that uses molecular descriptors in input and 
predicts carcinogenicity as a numerical value. Finally we added a classifier 
program to combine the results obtained from the two previous experts into a 
single predictive class of carcinogenicity to man. 



1 Introduction 

The goal to predict carcinogenicity is a challenging one, in consideration of the social 
and economical importance of the problem. Chemicals are responsible for many 
tumors, and industry is required to take into account carcinogenicity of the chemicals 
used and produced. However, the experimental tests on chemicals last for years, are 
costly and require the use of animals, with the consequent ethical problems. 
Considering the importance of the goal, it is interesting to continue the attempts to 
improve computerized systems to predict carcinogenicity. So far the most popular 
programs have been expert systems [1] (ES). In many cases they look for the presence 
of toxic residues in the molecule, as in [2]. More recently neural networks (ANN) 
have been used [3], and inductive learning [4]. 

In the present study we tried a new approach, combining different systems in hybrid 
architecture. We developed an ES able to recognize toxic residues predicting a class 
of toxicity. Eurthermore, we used ANN with molecular descriptors as input to provide 
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a different prediction of toxicity. Finally, we used a symbolic rule induction program 
to merge the information from the two sources. 



2 Definition of the Phenomenon to Be Modeled: Carcinogenicity 

Cancer is not a single disease. Furthermore, each single cancer involves a complex 
sequence of events. The complexity of the phenomenon means that experimental data 
are not precise, and in some cases contradictory. Most of the experiments are done on 
animals. Extrapolation of results from animals to humans is complicated also because 
in animal experiments high doses are used, while humans are generally exposed to 
low doses. 

Carcinogens are listed in classes by national and international agencies. The 
International Agency on Research on Cancer (lARC) considers four classes: the 
compounds which have been recognized as carcinogenic to man are in class 1, the 
compounds which are not carcinogenic (only a few compounds) are in class 4, the 
other compounds are split in three classes of different degree of uncertainty: probably 
or possibly carcinogenic to man {class 2A and 2B), not classifiable as their 
carcinogenicity to humans {class 3 - the most numerous one, characterized by the 
highest uncertainty). This classification combines, in the evaluation of 
carcinogenicity, the experimental evidences with the amount of epidemiological 
knowledge available. 

A different approach has been introduced by Gold and colleagues [5]. They 
developed a numerical data set that contains standardized and reviewed results for 
carcinogenicity for more than 1200 chemicals. The cancerogenicity data on rat and 
mouse are expressed in term of the parameter TD50, which is the chronic dose rate, 
which would give half of the animals tumors within some standard experiment time. 
The huge amount of data and the quantitative homogeneous evaluation represented 
two important advantages. 

Both kinds of characterization have been used: categorical (as the lARC) and 
continuous (as the Gold data set), the first with the residue approach, the second with 
the ANN. We extended its applicability to man using a symbolic rule induction 
program. To do this, for the training of this module we used the lARC classification. 
In the area of toxicity prediction QSAR (Quantitative Structure Activity 
Relationships) and SAR (Structure Activity Relationships) models are common. 
They are based on the evidence that the structure of a molecule is responsible of its 
activity, and that biological data about the mechanisms are not a must to predict the 
outcome. Generally SAR models for carcinogenicity are only able to classify in two 
classes: positive or negative, while QSAR models give a real value for the toxicity. 
Usually QSAR are methods to assess drugs; the challenge is to use them to predict 
toxicity values for large classes of chemicals and for complex phenomena as cancer. 



3 Our Residue Approach 

Many toxicologists consider the presence of given fragments in the molecule as an 
indication of potential carcinogenicity. 
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For instance, Ashby and Paton [6] listed many toxic residues responsible for 
adverse activity. CompuDrug at the end of the eighties started from this approach 
and encoded into its ES, called HazardExpert, the behaviour of selected residues 
based on a report by the U.S. Environmental Protection Agency. To enhance the 
efficiency of the system, there are built-in modules, which predict the dissociation 
constant {^Ka) and the distribution coefficient (logP). These can be used to predict 
the bioabsorption and accumulation of xenobiotics in living organisms, which have 
already been predicted, in addition to oncogenicity, mutagenicity, teratogenicity, 
irritation, sensitization, immunotoxicity and neurotoxicity. HazardExpert examines 
the compound itself as well as potential metabolites, based on modules providing 
for generation of potential metabolites. We made an accuracy test on HazardExpert 
in 1995 in the European project EST, and we found ways to improve it [7]. 
Sanderson and Earnshaw [8] used the rules if.. then introducing a series of 
substructures known to be toxic in the rule base of a system called DEREK 
(Deductive Estimation of Risk from Existing Knowledge), that then recognizes any 
such residues in the compound examined. DEREK makes qualitative rather than 
quantitative predictions. It looks for previously characterized toxicophores that are 
highlighted in the display and their toxic activity associated. The presence of several 
toxicophores in the molecule means there are more risks, but whether the risks are 
additive or not is decided by the user, also DEREK takes into account 
physicochemical properties such as logP and pKa. There are several toxicological 
endpoints including mutagenicity, carcinogenicity, skin sensitization, irritation, 
reproductive effects, neurotoxicity and others. 



3.1 Definitions of Rules of Ar-N Compounds 

We studied this topic for aromatic amines and related compounds; in particular, we 
considered all the aromatic compounds with at least a nitrogen linked to the aromatic 
ring (Ar-N compounds), that contain a large number of chemicals, many of them 
carcinogens. The Ar-N group is divisible into 10 chemical classes further split into 
some subclasses, as shown in Table 1. 

While classes are defined only considering the presence of a chemical group 
characterizing the Ar-N bond, subclasses are bounded by the following criteria; 

1 . presence of the same atom or substituent or chemical structure in a fixed position 
relative to the Ar-N bond, 

2. implementation convenience: in order to reduce memory needs and accelerate 
the computer search; 

3. toxicological affinity of chemicals in terms of TD50 values, target tissue and/or 
lARC class. 

The structure for implementing this knowledge is a two-level structure, as illustrated 
in Eig. 1. 
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Table 1. Ar-N compounds divided into classes and subclasses. 



1) PRIMARY AMINES 




a- Monocyclic aromatic primary amines 
b- Pentaatomic heteroaromatic primary amines 
c- Hexaatomic heteroaromatic primary amines 
d- Biphenyl primary amines 

e- Di- and triphenylmethane amines and analogues 
f- 4- and 4,4’-Stilbenes 
g- 2-aminofluorene and analogues 
h- Condensed polycyclic primary aromatic amines 1 
i- Condensed polycyclic primary aromatic amines 2 


2) NITROCOMPOUNDS 




a- Monocyclic aromatic nitro compounds 
b- 2-nitro-5-furyl 

c- Thio- and azo-pentaatomic nitro compounds 
d- Condensed polycyclic nitro compounds 1 
e- Condensed polycyclic nitro compounds 2 
f- Miscellaneous nitro compounds 


3) AZOCOMPOUNDS 




a- Dibenzo azo compounds 
b- 1-naphtho azo compounds 
c- 2-naphtho azo compounds 


4) HYDRAZINES 




a- Hydrazines 1 
b- Hydrazines 2 


5) SECONDARY AMINES 




a- Aromatic secondary aliphatic amines 
b- Diphenyl secondary amines 
c- Carhazole 

d- Solfonic secondary amines 
e- Purines 


6) AMIDES 




a- Monocyclic aromatic amides 
b- Biphenyl amides 
c- 2-acetylaminofluorene derivatives 
d- Pentaatomic heteroaromatic amides 
e- Hexaatomic heteroaromatic amides 


7) TERTIARY AMINES 




a- Monocyclic aromatic tertiary amines 
b- Di- and triphenylmethane tertiary amines 
c- N,N-dihydroxyethyl tertiary amines 
d- Nitrogen mustards 
e- Pentaatomic heterocyclic tertiary amines 


8) C-NITROSOCOMPOUNDS 


9) N-NITROSOCOMPOUNDS 


10) ISOCYANATES 
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For each subclass a first level structure, which identifies the chemical fragment 
common to each residue belonging to the subclass, has been individuated. The second 
level structures specify each residue. Two corresponding inhibition levels have been 
introduced for situations where the found fragment has no effect. 

- First level: identifies the structure of the nitrogen fragment characterizing the class 
and the aromatics structures bonded to that group. 

- First inhibition level: it solves the problem of compounds that, even if related to the 
structure of the subclass, are not carcinogens or have been ascribed to another 
subclass. 

- Second level: the second search level permits the identification of a specific 
compound or small groups of compounds that refer to the same subclass but differ for 
some specific elements bound to the nitrogen group and/or to the aromatic structure, 
and suspected to be involved in the carcinogenicity process. 

- Second inhibition level: this second inhibition level is useful to exclude a specific 
compound or a small group of compounds. 



First Level Structure: 
1-Naphtho azocompounds. 



First Level Inhibition. 



XX XX 

X -<0)— N-N— (OVx 
X(sp2)(sp2)^ 

X X x^QVx 



X X 




Second Level Structure: 
Bensub- 1 NA residue 



Second Level Inhibition 



N (fi) 





Fig. 1 . Example of structure levels. The Figure shows the structure to search at the first and 
second levels and the relevant inhibitions. 



Each fragment is associated with a category expressing the level of toxicity. Our 
system reports the highest level obtained and the residue responsible; if more than a 
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toxic residue is present, the program selects the most active. We defined five 
“carcinogenicity levels”, using three parameters: 

- the TD50 of the molecules [5]; 

- the level of carcinogenicity ascribed to the fragment contained in the molecule 
(averaging the evaluation for each fragment on all molecules containing the 
substructure); 

- the classification or the evidence of carcinogenicity given by the databases lARC, 
IRIS, HSDB, NTP, RTECS. 

The COSMIC format has been chosen to describe the molecules; it uses atom 
hybridization instead of information on atomic bonds, with two positive 
consequences: 

• All bonds are equals. The chemical information is hidden in the nodes and so 
the search algorithm is easier. 

• Hydrogens are left out. The molecular graph is smaller and so the search is 
faster. 

3.2. Internal Representation 

Graph theory was used to represent the chemical structures. They are stored as 
adjacency lists: given the node i, the nodes in the list I contain atoms that are adjacent 
to vertex i as shown in Figure 2. 

3.3 Search Method 

The search of a fragment in a molecule is a subgraph isomorphism problem. A graph 
is isomorphic to a subgraph of a graph if and only if there is a one to one 

correspondence between the node sets of this subgraph and those of g^ that preserves 
adjacency. The computational complexity of this problem is, in general, NP-Complete 
[9]. The search operation has been divided into two parts: the first search level is 
performed by finding all possible isomorphisms between the structure considered and 
the molecule, with the Ullmann's algorithm [9], modified to manage hydrogens and 
wildcards. After finding a first level structure, the second part of the search procedure 
checks positive and negative conditions, using a backtracking technique. If a second 
level structure and no inhibition are found, a residue is considered found. 



4 The ANN-Based Prediction 

Backpropagation neural network [10] has been adopted in this study to implement the 
quantitative prediction of carcinogenicity; more details are in [11]. 

From the Gold’s database 104 molecules presenting an aromatic ring and a nitrogen 
linked to the aromatic ring have been chosen. We computed molecular descriptors of 
six main groups (physico-chemical, geometric, topological, electrostatic, quantum- 
chemical and thermodynamic); from the initial set of 34 descriptors a selection was 
necessary in order to avoid an excessive time for training the network. Principal 
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component analysis (PCA) has been used for the selection, building a final set of 13 
descriptors (molecular weight, HOMO, LUMO, dipole moment, polarizability, 
Balaban, ChiV3 and flexibility indices, logD at pH 2 and pH 10, third principal axis 
of inertia, ellipsoidal volume, electrotopological sum). 



Chemical Structure 



Adjacency Lists 




Fig. 2 The implementation through the adjacency lists. 



The parameter TD50 reported by Gold et al. [11] has been adopted for the output. The 
output has been derived from a transformation of the TD50 according to the following 
formula: 

output = Log (MW*1000/TD50) 

Data were scaled between 0 and 1. The output has been also scaled accordingly. The 
scaling was based on the training set. Validation set was scaled on the basis of scaling 
of the training set. 

All simulations were performed using MBP v 1.1 [10], initialing the weight with the 
SCAWI technique, and using the acceleration factors. Each network has been trained 
starting from 100 points random in the space, in order to minimize the probability of 
converging towards local minima. 

For validation the N/2-fold-crossvalidation has been used. MSE and resulting 
from 10000 iterations of the back-propagation ANN, using different numbers of 
internal neurons, showed best results using four or seven hidden neurons: was in 

both cases 0.691. 

The presence of outliers in the set has been supposed and investigated; 12 molecules 
were identified as outliers and removed. Results after outliers removal showed clear 
improvement in the R^^ which became 0.824 (with 4 hidden neurons). The majority (9 
out of 12) of the outliers is molecules for which the experimental results were not 
statistically significant and an arbitrary 10^' value was given by Gold. 
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5 Combining the Two Predictions into the Hybrid System 

The results we obtained from the two parts of the prediction should now be combined. 
In the present study we added a third module dedicated to the classification. Given the 
output of the residues research, and the expected TD50, we wanted to extrapolate a 
combined prediction of the human carcinogenicity. 

We split some classes of the lARC classification according to the following criteria: 

- to define 5 classes, 1 to 5, from lower to higher risks, based on the TD50 values; 

- to check the presence of each residue in the molecules under study; 

- to give to each residue a toxicity class obtained as the mean of the toxicity of the 
molecules where it was found; 

- to assign to the molecule the maximum toxicity obtained from the residues and 
ANN module. 

We built classification trees from examples, using different tree construction 
programs: 

• C4.5 [12] which makes use of the maximization of the entropy gain, and 
builds hyper-rectangular in the attribute space; 

• CART, which builds binary trees [13]; 

• OCl [14], that uses a random perturbation of parameters to escape from local 
minima. 

The training set has been prepared with all molecules and two attributes each, TD50 
(predicted by BNN) and the carcinogenicity category predicted by the residue 
module. Performances using the leave-one-out are in Table 2. 

Table 2. Results obtained with tree induction systems (accuracy %) 





1 ■ C4.5 


CART 


OCl 


Training 


93.3 


88.5 


90.2 


Validation 


81.9 


85.5 


82.8 



6 Discussion and Conclusions 

Results in Table 2 show the accuracy (the ratio between the sum of correct 
assignments and the total compounds) show promising possibilities. The integration 
of the two approaches improved the performances of the individual methods. 

Very few comparisons have been made of different ES in toxicology. Most of the 
papers presented by the authors of the different ES claim good predictions, often 
better than 90%. Omenn [15] reported the results of predictions on 44 chemicals 
made by some human experts and different computer programs. Table 3 compares 
the results with the ES and the best human results. This indicates that for the time 
being no ES can do better than a good human expert. 

A particular problem is the nature and evaluation of the information. In several 
cases experts pay special attention to some data and overlook others, because they 
know from experience which data are most reliable. Sometimes their experience is 
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concentrated on certain aspects of the problem. As a consequence, different experts 
will give different answers. 

Table 3.- ES and human experts predictions for toxicology of 44 chemicals 



Expert 


Accuracy 


Percentage 


Human experts 


30/40 


75 


DEREK 


22/37 


59 


TOPKAT 


14/24 


58 


COMPACT 


19/35 


54 


CASE 


17/35 


49 



For this reason to assess the reliability of predictive models we must rely on internal 
evaluation, mainly on leave-one-out. The best we can expect is to be able to correctly 
predict a new external set, as we will try in the future. 

In the attempt to overcome the limitation of attribute-based learning, some programs 
learn first-order predicate logic. Given background knowledge (expressed as 
predicates, positive examples, and negative examples) the ILP system is able to 
construct a predicate logic formula H such as all the positive examples can be 
logically derived from the background formulas and H, and no negative example can 
be logically derived. 

The main work in ILP is the predictive toxicology challenge, which aims at 
constructing a SAR model based on data from NTP (National Toxicology Program). 
The NTP produced the PTE data sets, based on the study of about 300 compounds, to 
be used as training set, and the definition of small tests sets (30 compounds). For all 
molecules the carcinogenicity is available, expressed as yes/not. The data set presents 
a mix of chemicals (both organic and inorganic as representative of 19 millions); 
some chemical classes are not represented, some known biological mechanism is not 
represented. 

In [16] a report of the submissions to the challenge is shown. In the models, presented 
by 9 laboratories, the best estimated accuracy is 0.87 for a stocastic system, on the 
outcome of 23 of the 30 molecules. The other models range from 0.78 o 0.48. The 
approach based on ILP reached 0.78. 

Our research confirms the feasibility of an ANN for carcinogenicity for several 
chemical classes, which exhibit their activity according to different mechanisms. A 
valuable characteristic of our ANN is that it seems to correctly predict carcinogenic 
compounds; unfortunately, it is less accurate in the prediction of non-active 
compounds. It is likely that ANN alone cannot solve all the problems linked with 
carcinogenicity prediction. A classical example is the case of ortho- and para- 
anisidine that have very similar descriptors values, but one of the compounds is 
carcinogenic, while the other not. In this case an approach based on the residues can 
distinguish the two chemicals. 

An advantage of our architecture, which evolved from a previous work on fitotoxicity 
[17] is that the output is not simply a classification into two classes of activity: 
carcinogenic or not, as in several programs predicting toxicity. Our system gives a 
quantitative prediction of the activity, and also a classification similar to lARC. 
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Moreover, we do bot need biological data to predict carcinogenicity. A key advantage 
of programs based on the simple chemical structure is that they do not require the 
synthesis of the chemical to be tested and biological experiments in order to make 
prediction. However, in the hybrid architecture we defined it is easy to introduce in 
the rule induction program other inputs, such as results from mutagenicity tests. 

Acknowledgements. The European Union contracts COMET and IMAGETOX. 
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Abstract. Multiple classifier systems fall into two types: classifier 
combination systems and classifier choice systems. The former aggregate 
component systems to produce an overall classification, while the 
latter choose between component systems to decide which classification 
rule to use. We illustrate each type applied in a real context where 
practical constraints limit the type of base classifier which can be used. 
In particular, our context - that of credit scoring - favours the use 
of simple interpretable, especially linear, forms. Simple measures of 
classification performance are just one way of measuring the suitability 
of classification rules in this context. 

Keywords: logistic regression, perceptron, support vector machines, 
product models 



1 Introduction 

This paper argues that there are two kinds of multiple classifier system, which we 
shall call classifier comhination systems and classifier choice systems. Classifier 
combination systems combine the predictions from multiple classifiers to yield 
a single class prediction. In contrast, classifier choice systems select a single 
classifier from a set of potential candidate classifiers, choosing one which is best 
suited to classify the target point, where ‘best’ is defined in some appropriate 
sense. We shall further argue that, although classifier combination systems are 
the most common kind of ‘multiple classifier system’, in a deeper sense it is 
inappropriate to regard them as multiple classifier systems, and that the name 
might better be reserved for classifier choice systems. We illustrate these ideas 
with systems we have developed in the context of a particular practical domain 
— that of credit scoring in retail banking. 

The illustrations presented in the paper are motivated by the following ob- 
servation: much work on classifier design goes too far. In particular, a colossal 
effort has been expended by the research community on developing classification 
rules which have as small an (out of sample) error rate as possible. This is inap- 
propriate for several reasons. Firstly, as argued in Hand error rate is seldom 
of real interest in practical problems. Secondly, almost all of the assessment of 
classification rules is based on the assumption that future points to be classified 
are obtained from the same distribution as the design set. The very choice of 
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the word ‘future’ here suggests that often this is a false assumption: populations 
change and evolve over time. Thus the distribution from which future points 
are drawn is unlikely to be exactly the same as that from which the design set 
was drawn - and the difference is likely to increase over time. (For a discussion 
and illustration of this see Kelly et al [5|). The implication of this is that it is 
pointless refining the performance of a classification rule beyond a certain point: 
by the time it is used in practice, the problem may have changed so that other 
sources of uncertainty and variation far outweigh the gain in accuracy resulting 
from refining the rule. Thirdly, in many real problems (not all) there is intrinsic 
arbitrariness in the definitions of the classes. For example, Kelly et al P| and 
Kelly and Hand P] describe situations when the classes are defined in terms of 
thresholds on an underlying continuum - and where the threshold is a human 
choice. The arbitrariness of the definitions, and the possibility that one might 
want to change them, means that refining the classification rule to match one 
particular definition very well may be wasted effort. More generally than these 
reasons, there is also the point that a simple focus on error rate (or any other 
specific measure of performance for that matter) is one-sided: there are many 
other aspects to a good rule, including features such as interpretability (the ma- 
chine learning community has generally put more emphasis on this than has the 
statistics community - perhaps leading to classification rules such as rule-based 
methods which are more attractive in real applications), speed, and the readi- 
ness with which the ‘reasoning’ underlying the classification may be explained 
to a non-experts (in human commercial applications there are sometimes legal 
requirements for this, but more generally such facility expedites good customer 
relations). There are also special circumstances affecting applications of classi- 
fication rules. Indeed it is possible that, when looked at closely, every situation 
is rather different from every other. An example of such a special circumstance 
arises in building rules to predict the classes objects will fall into in the future 
if one never discovers the true class of those one assigns to class 1 (say). This 
arises, for example, in any situation where the classifier is used to decide which 
cases to investigate (e.g. tax investigations, medical investigation, etc.) 



Motivated by these points, we describe some approaches to multiple classifier 
systems below which deliberately constrain the model to take simple forms. This 
does not mean that the models are not complicated - they are multiple classifier 
systems, after all - but just that the final decision is based on a simple form. We 
illustrate such ideas for both classifier combination systems and classifier choice 
systems. 



The body of this paper is divided into two main sections. The first deals with 
classifier combination systems, and the second with classifier choice systems. In 
each case we give an informal background description of the basic ideas, and 
then illustrate how some of our work fits into each framework. 
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2 Classifier Combination Systems 

2.1 Background 

The simplest approach to combining classifiers is to take a maximum vote of the 
predicted classes from each of several distinct and separately estimated classi- 
fiers. There are two distinct possibilities here. Firstly, the classifiers could each 
be of a different kind ~ for example, one could be a tree classifier, one a neural 
network, one a nearest neighbour classifier, and so on. The hope would be that 
the combined classifier could steal the strength of the strongest individual one 

- but note that the overall classifier is a combination of the individual results. 
This is in contrast to systems described in Section Id. H which partition the space 
of predictor variables into regions such that the ‘best’ individual classifier is used 
in each region. 

The second possibility is that the classifiers would all be of the same form: 
for example, they might all be tree classifiers, or simple regression classifiers 
applied to different subsets of the data - as in bagging (Breiman |^). With 
this approach, the way in which the individual classifiers are generated yields 
a ‘weighting’ across the classifier space - including many classifiers similar to 
a particular one is equivalent to weighting that one heavily. This observation 
leads us onto the first extension of the basic voting method: instead of simply 
counting the predictions, evaluate a weighted sum (an ‘average’ over the weight 
distribution) of predicted classes from each of several distinct and separately 
estimated classifiers. We conjecture that statisticians tend to think in terms of 
weighted combinations of constituent classifiers, while computer scientists tend 
to think of voting systems. 

The key question here, of course, is that of what weights to use. This has been 
investigated by many researchers. Empirical approaches based on performance of 
the individual classifiers are one possibility. It is also possible to cast this problem 
into a formal Bayesian framework - and this has proved a popular approach 
to combining tree classifiers (where it is sometimes seen as an alternative to 
pruning). See, for example, Buntine ^ and Oliver and Hand (Z|. Indeed, there are 
stronger links to Bayesian statistics here, via work on combining the opinions of 
experts (all we have to do is regard the individual classification rules as ‘experts’) 

- see, for example, Genest and McConway 0. 

Despite all that, the phenomenon known as the flat maximum effect (von 
Winterfeldt and Edwards |2| ; Hand P) suggests that often the use of equal 
weights will be almost as effective as ‘optimal’ weights. Indeed, Kittler and 
Duin m reported precisely this: ‘A surprising outcome of the comparative study 
was that the combination rule developed under the most restrictive assumptions 

- the sum rule - and its derivatives consistently outperformed other classifier 
combinations schemes. To explain this empirical finding, we investigated the sen- 
sitivity of the various schemes to estimation errors. The sensitivity analysis has 
shown that the sum rule is most resilient to estimation errors.’ 

Much of the above also applies to another extension of the weighted class 
predictions from different rules. This is to average the predicted ‘probabilities’ 
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rather than the predicted classes. Suppose that the jth classification rule esti- 
mates the probability of belonging to class 1 (say two classes for simplicity), 
p(l|x) by pj(l|x). Then its error is 6j(x) = pj(l|x) — p(l|x). It follows that the 
expected squared error of a weighted sum with weights wj is 




With a little standard algebra this can be rewritten as w'Cw + (w'E(e)y, where 
C is the covariance matrix of the errors. If E{e) — 0 then (imposing the con- 
straint w'w = I), the set of weights which minimise this are the components of 
the eigenvector of C corresponding to the smallest eigenvalue. These ideas can 
be applied more generally, and do not require the e to be defined in terms of 
differences between probabilities and their estimates. They could, for example, 
be in terms of error rates. 

A more elaborate extension is to relax the requirement that the component 
classifiers are estimated separately. That is, given that we know that the classi- 
fiers are to be combined, perhaps we could optimise the parameters of the indi- 
viduals in the context of using the others as well. This is similar to the distinction 
between using a combination of simple linear regressions and using a multiple 
regression. An example of this strategy is given in Mertens and Hand m and 
we give a further example below. Of course, estimation of such models is much 
more laborious than when the individual parameters are estimated separately. 
The usual approach is an iterative one cycling between the different components 
(if this is arranged in such a way that the criterion decreases monotonically, then 
convergence is guaranteed by the monotone convergence theorem). 

Up until this point we have assumed that the individual ‘base’ classifiers are 
combined through a linear combination. Relaxing this leads to our next exten- 
sion, in which the outputs of each classifier (estimated probabilities of belonging 
to each class, or a predicted class, for example) are used as input to a higher order 
classifier, where this need not be a simple linear combination classifier, but could 
be arbitrarily complex. This is the principle underlying Wolpert’s idea of stacked 
generalisation (Wolpert ^21). Of course, as soon as we do this, we see (another 
generalisation) that it is unnecessarily constraining to require the components 
to be classifiers. They are really simply feature extractors. This means that they 
need not be chosen to predict probabilities of class membership but could be more 
elaborate feature extractors. Perhaps, if estimated separately, then it might be 
an effective strategy to let them each be classifiers, but if estimated simultane- 
ously this might not be so good. This extension is why we think it is misleading 
to think of such systems as ‘multiple classifier systems’: the components need 
not be classifiers at all. At this point, of course, we have described what essen- 
tially takes place in neural networks, projection pursuit regression, generalised 
additive models, boosting (which is really a type of generalised additive model 
- see Friedman et al C3]) and other highly flexible modern classification tools. 
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2.2 Product Classifier 

In Section Q we remarked that often criteria other than mere classification per- 
formance was crucial. The work described here arose in a context in which a 
strong historical preference had been placed on simple linear classification rules 
based on categorised variables - ‘front end’ credit scoring in the retail banking 
sector (Hand and Henley This emphasis is a consequence of the ready 

interpretability and the ease of explanation of such rules to those with limited 
statistical expertise. One consequence is a large base of software which uses such 
rules in this sector, along with a very large base of user expertise and familiarity 
with such rules. We should note parenthetically here that front end credit scor- 
ing describes situations in which one might be called upon to justify the decision 
to a lay person - hence the desire for simplicity. In contrast, in back end scoring 
(fraud detection is an illustration) systems of arbitrary complexity can be used. 

Models of the kind we are concerned with may be written as 



where the summation is over the variables, and where takes a value which 
depends on the category of the fth variable into which the case to be classified 



The earliest models of this kind are naive Bayes models (Hand and Yu [El , in 
which the parameters are obtained by combining values estimated separately 
for each of the classes. As shown in Hand and Adams PEI and Hand however, 
more powerful models result if the parameters of the logit transformation of 
the class probabilities are estimated directly - a logistic regression on dummy 
variables defined by the categorisation. Because of the nonlinear transformation 
implicit in the categorisation of the variables prior to fitting the model, the 
decision surfaces in such a model can be quite complicated, and are certainly 
not constrained to be linear. On the other hand, they remain simple. While 
an attractive feature of such models, this simplicity does raise the question of 
whether they could be made even more powerful without too much sacrifice of 
the notion of the combination of contributions from each variable. We explored 
this in the classifier combination system described below. 

Since models of the form in (1) are so widespread in the retail credit industry, 
we wanted to retain this form as our base classifier. We also wanted our classifier 
combination system to be simple enough to explain to bank employees who 
may not have a numerate background. This meant that, for example, a model 
averaging approach was not acceptable. 

We eventually settled on models which use structures of the form (1) as 
factors in a product form. Thus a simple two component model will take the 
form 



p 




( 1 ) 



falls. 



p p 






( 2 ) 
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Given the parameters of the model (the at and bi) the predictions from the model 
are obtained by using the standard bank software to obtain the components of 
(2), and these are simply multiplied together. Both of these steps are thus simple 
and, more importantly, the overall process is only slightly more complicated than 
those used at present (based on a single linear combination). 

The parameters of this model could be estimated separately, as described 
in Section r2.1l and as is the case with most classifier combination systems. 
However, the real gains of this system are likely to manifest themselves when 
the parameters of each component are estimated in the context of each other 
component - when the model is treated as a whole. We have developed a simple 
iterative parameter estimation procedure (Hand and Kelly We used this 

model to make predictions about whether applicants for current accounts would 
turn out to be good or bad risks (defined in a formal operational way, which 
cannot be publicly stated for commercial reasons). That is, we have a two class 
problem, with the true class being discovered by following the customers over 
time. One of the distinguishing features of problems of this kind in the retail 
credit context is that the proportion of bad customers is often low (the classes 
are ‘unbalanced’): in this case 3.76%. 

Another feature, and one which is crucial and which illustrates the points 
about assessing classification rules made in the opening section, is that one does 
not observe the outcome for all cases. The classifier is essentially being used as 
a screening instrument, and one only follows up and observed those applicants 
one accepts (the predicted ‘goods’). This means that criteria such as error rate 
simply cannot be calculated - apart from any doubts one might have about the 
issue of regarding the two kinds of misclassification as equally serious. For such 
problems, we have argued (Hand [IDj l that the most appropriate criterion is 
bad rate amongst accepts, and that this is best displayed as a plot of bad rate 
amongst accept against the accept rate, so that the choice of operating point 
can be made on the basis of as much information as possible. 

A further complication arises from the fact that the classes in problems of this 
kind are often poorly separated. This means that there is often little difference 
between classifiers, and that what difference there is can easily be swamped by 
random variation, especially if (relatively) small data sets are being used for the 
estimation. On the other hand, a little difference matters in a context where 
a small reduction in number of ‘bads’ accepted can translate into millions of 
pounds. In the present illustration, we were provided with a sample of 10,000 
observations. We split this randomly into a design set and test set 200 times, 
estimating both a basic logistic model and simple product model of form (2), 
and averaged the results. 

The results of using the logistic model (upper curve throughout figure) and 
the two factor product model are shown in Figure Q The differences here are 
of sufficient size to be significant in the banking context. For example, with an 
accept rate of 90% (typical for this application), the product classifier has a bad 
rate amongst accepts 0.26% below that of the logistic model. This translates 
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into an 8.87% reduction in the bad rate among the accepts compared with the 
logistic model (an improvement which has been described as ’massive’). 




Fig. 1. Logistic model and product classifier on current account data 



3 Classifier Choice Systems 

3.1 Background 

Classifier choice systems are systems of classifiers in which only one component 
of the system is chosen to actually make the classification. Obviously, a different 
component may be chosen for each point (or else it would reduce to choosing 
just one classifier from a set of classifiers during the system’s construction). Once 
again, one might choose the component classifiers to have different forms or one 
could choose them to have the same form (but with different parameters). We 
discuss the latter approach in more detail in Section 3.2. 

Multiple classifier systems of this form may be thought of as partitioning the 
measurement space during training, so that a particular (‘the best’) classifier 
relates to each cell of the partition. The nature of the partitioning will depend 
on the nature of the component classifiers: linear classifiers will lead to piecewise 
linear surfaces separating the regions. In general, a local performance estimate 
could be made, possibly, but not necessarily, at each design set point, during 
construction. 

A different approach to classifier choice systems has been suggested by 
Scott PO] and Provost and Fawcett PJ. This arose from the observation that dif- 
ferent classifiers are best at different operating points. Figured shows a schematic 
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diagram of the ROC curves for two classifiers (see Hand Q for a discussion of 
these). As the operating point of a problem changes, so the performance of a 
classifier is given by the corresponding point on its ROC curve. However, when a 
system of classifiers is used, the best possible performance is given by points on 
the convex hull of the ROC curves. Thus, in the figure, given the two classifiers 
illustrated there, for operating points between A and B, neither classifier alone 
can achieve performance on the line connecting A and B, but taken together such 
performance can be achieved. In particular, for (1-specificity) values between A 
and B, best performance is obtained by choosing between the rules at random 
- yielding an overall classifier with performance which lies on the straight line 
connecting A and B. That is, a randomly chosen proportion a of points are 
classified by one classifier and a proportion 1 — a by the other, with 0 < a < 1. 
This is a ‘random choice’ classifier system. 




Fig. 2. A random choice classifier system 
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3.2 Cost Specific Classifiers 

We are again concerned with situations in which we need, for external reasons, 
to adopt a classifier of simple form, and we will assume (again we are motivated 
by the credit scoring environment) that we will use a linear classifier. In general, 
this simple form will mean that the model is misspecified - it would be a rare 
situation in which the true probability of belonging to class 0, say, did follow 
a logistic model exactly. This implies that there are places, in predictor space, 
where the model does not match the underlying conditional distribution p(0|x). 
In particular, there are places (presumably most places) where the contours of 
the model will not lie along the true contours (which may be curved, unlike those 
of a logistic model, or may have different directions in different parts of the x 
space, even if they are linear). 

The majority of comparative studies of classification rules use error rate as the 
measure of performance. This assumes that the costs of the two (we restrict this 
discussion to two classes) types of misclassification are equally serious. From this 
it follows that the optimum rule is based on comparing an estimated probability 
with the contour p(0|x) = 1/2. This contour is the crucial one for classification 
when the costs are assumed equal. If, on the other hand, different costs are used, 
then different contours become relevant - as noted in Adams and Hand if 
the cost of misclassifying a point from class z is c^, then the relevant contour is 
p(0|x) = ci/(co + ci). Given that the contours are unlikely to be parallel, and are 
thus unlikely both to be estimated effectively by the single misspecified logistic 
regression model, performance with at least one of these sets of costs will be 
poor. Of course, this applies more generally, and, in general, different models 
will be best suited to different costs. 

A single model (such as the single classical logistic model) is obtained by 
aggregating over the entire x space, but we are really concerned with a more 
local interpretation. All we really need to know, to classify a new object with 
measurement vector x, is on what side of the relevant contour it lies. In par- 
ticular, we do not need accurate estimates for p(0|x) for the new point. When 
we aggregate, we are distorting the contour of interest by those which we are 
not interested in. We are ending up with an average of the (differently oriented 
or shaped) contours, and it is unlikely that the particular one of interest will 
coincide with this average. Note that this problem is overcome in methods such 
as the perceptron estimation algorithm, and, more recently, support vector ma- 
chines, because they are more fundamentally local in nature. They concentrate 
on the decision surface (the contour of interest) and its vicinity, and are not led 
astray by the averaging with other contours. 

We have experimented with local logistic models which iteratively reweight 
the design sets points, favouring those near to the current estimate of the relevant 
contour. This has a flavour similar to that of boosting, except that we choose 
a single logistic model, best suited to the costs concerned, rather than taking a 
weighted average of all the models constructed. Adams and Hand . have 

described how difficult it is to choose a set of costs accurately. For this reason, 
rather than producing a single model, we find a set of models, covering the range 
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of cost ratios. That is, we present a set of models from which one is chosen - 
hence the description of this approach as a classifier cost system. 

To illustrate the ideas, Figure |3 shows simulated data in which 1000 points 
are uniformly distributed over the unit square, and where the contours of the 
probability of belonging to class 0 are straight lines radiating symmetrically 
outwards from the origin (at the bottom left of the figure). That is, each of the 
1000 points were assigned to class 0 (denoted by a cross) or class 1 (denoted by a 
circle) with a probability determined by these contours. Logistic regression, with 
true decision surface at p(0|x) = 0.5, leads to the (correct) estimated decision 
surface given by the continuous diagonal line in the figure. (In fact the contour 
shown differs slightly from the true one, which stretches from (0, 0) to (1, 1), 
because of sampling variation.) All the other estimated contours arising from this 
model are parallel to this diagonal line, but shifted up or down. In particular, the 
estimated contour corresponding to p(0|x) = 0.8 lies in the upper triangle and 
thus crosses the true contour, which has a steeper slope. That is, all contours 
except the 0.5 one have the wrong slope. This inaccuracy has arisen because the 
estimate of the 0.8 contour has the same orientation as the others, an average 
over all of them, despite the fact that the true contours have different slopes. 




Fig. 3. Simulated data illustrating cost specific classification rules 



To construct a cost specific classifier for the 0.8 contour, we estimated the 
probability of belonging to class 0 using nearest neighbour methods and dis- 
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carded all points with estimates less than 0.7 or greater than 0.9. We then built 
a logistic regression model using only the remaining points. This led to the 0.8 
contour shown as a broken line in Figure 0 This is much closer to the true 
orientation for the 0.8 contour. 

These ideas are not restricted to models such as logistic regression, which is 
obviously a global model, but apply more generally to any model which estimates 
parameters according to global rather than local goodness of fit or predictive abil- 
ity. For example, the standard k-nearest neighbour classifier is based on a single 
choice of k, obtained by cross-validated error rate or some more sophisticated 
method. But this single choice of k is an ‘optimal’ value aggregated over the 
whole space, and ignores the fact that different values may be best in different 
regions - and that the region which matters for the problem at hand will depend 
on the costs. This is true even for sophisticated Bayesian methods of averaging 
over k (eg. Holmes and Adams |23)- 

4 Discussion 

Multiple classifier systems have attracted much interest because of their potential 
for combining component classifiers to yield overall performance better than that 
of any component. However, practical problems often have other features and 
restrictions, beyond issues of simple classifier performance, and the adaptability 
and flexibility of multiple classifier systems means that they can be effective 
at meeting these requirements. We have illustrated by describing two multiple 
classifier systems based on simple linear classifiers - a form which arose from the 
practical constraints of the domain we are studying. We have found it convenient 
to divide multiple classifier systems into two types, those which aggregate the 
components and those which choose between them, and have illustrated one of 
each kind. 
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Abstract. It is known that the Error Correcting Output Code (ECOC) 
technique can improve generalisation for problems involving more than 
two classes. ECOC uses a strategy based on calculating distance to a 
class label in order to classify a pattern. However in some applications 
other kinds of information such as individual class probabilities can be 
useful. Least Squares(LS) is an alternative combination strategy to the 
standard distance based measure used in ECOC, but the effect of code 
specifications like the size of code or distance between labels has not 
been investigated in LS-ECOC framework. In this paper we consider 
constraints on choice of code matrix and express the relationship between 
final variance and local variance. Experiments on artificial and real data 
demonstrate that classification performance with LS can be comparable 
to the original distance based approach. 



1 Introduction 



Use of Error Correcting Output Codes (ECOC) for decomposing a multi-class 
problem into a set of complementary two class problems is a well established 
method in many applications pil2l4l5ltil7l8l9ll()lllll2lldll5llbll7ll8| . When first 
suggested ECOC was based on the idea of using error-correcting codes as class 
labels, so that individual classification errors propagated from a set of binary 
classifiers can potentially be corrected ^ . For a two-class problem, classification 
errors can be one of two types, either predicted class wi for target class W2 or 
predicted class IV2 for target class wi. 

In the ECOC method, a, k x b binary code word matrix Z has one row (code 
word) for each of k classes, with each column defining one of b sub-problems that 
use a different labelling. Specifically, for the jth sub-problem, a training pattern 
with target class Wi {i = l...k) is re-labelled as class wi if Zij = x and as class W2 



if Zij = X (where a; is a binary variable, typically zero or one). One way of looking 
at the re-labelling is to consider the k classes as being arranged into two super- 
classes. The original ECOC combining strategy uses a simple distance measure 
(LI norm) which is calculated with respect to real-valued classifier outputs to 
determine the closest code word and assigns a test pattern accordingly. If the 
code word matrix satisfies suitable constraints, this strategy is identical to the 
Bayesian decision rule jlVIlfij . A problem with imposing constraints on code 
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words is that the generation process becomes very complex, but fortunately 
these constraints are approximated by random codes providing b is large enough 



Despite improvements in generalisation for many problems that have been 
reported for ECOC, there is some discussion as to why it works well. A long 
random code appears to perform as well or better than a code designed for its 
error-correcting properties |S| . Attempts have been made to develop a theory for 
ensemble classifiers in terms of bias/variance and margin 0, but so far these 
ideas have not provided a convincing explaination for ECOC. A practical ap- 
proach to determining source of effectiveness of ECOC is to look at variants 
of the ECOC strategy to see how they perform. This is also useful if we want 
to extend ECOC to deal with applications for which it would be desirable to 
understand ECOC features as estimation measures. 

In this paper we look at an alternative ECOC combination strategy based on 
Least Squares (LS-ECOC), which was investigated in [Hj and extended by incor- 
porating ridged regression when b is small jSj . Recovering individual class proba- 
bilities from super-class probabilities is easily accomplished by matrix inversion 
when the individual probability estimates are exact and columns of ECOC ma- 
trix are arranged in “one-per-class” structure. In practice, estimates are not 
perfect and a natural choice for attempting to recover probabilities is Least 
Squares. However the effect of the code on performance of LS-ECOC has not 
been investigated in the way that it has for ECOC. 

In Sect. 2 we determine, for Least Squares combining, the required form of 
the ECOC matrix such that errors in super-class probabilities (local experts) and 
individual class probabilities are jointly minimised. In Sect. 3 we find the rela- 
tionship between final variance and the variance of expert’s error as a function 
of number of columns b and distance between rows of ECOC matrix for equi- 
distance codes. Experimental results in Sect. 4 demonstrate the effect of code se- 
lection on classification performance in comparison with original distance-based 
approach. 



Decomposition of a multi-class classification problem into binary sub-problems in 
ECOC can be interpreted as a transformation between spaces from the original 
output q to p, given in matrix form by 



Having the estimation of posterior probability pj of super-classes (provided by 
jth expert), this matrix equation can be solved to find an estimation of class 
membership probabilities q. However, Z'^ is not a square matrix in general, and 
so does not have an inverse. Furthermore base classifiers will not produce correct 
probabilities, and the error can be represented by 



m- 



2 ECOC and LS-ECOC 




( 1 ) 



k 




3 = ^-b 



( 2 ) 
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A natural unbiased solution to equation Q is based on using the method of 
least squares which means finding q which minimises a cost function such as 

= = J2^Pj - (3) 

j=l j=l j = l i=l 

The optimum point is given by 

g* = (Z.Z^)-KZ.p (4) 

For the solution of equation (0J to exist, ZZ'^ must be non-singular. If all 
elements of the *th row in Z are zero (zn = 0 for any 1), or if two rows (or 
columns) are equal, ZZ"^ is singular. Also these conditions are not meaningful 
for the decomposition, so when the code is generated we make sure that they 
do not occur. In summary, having a precise estimation of p (Bayesian binary 
experts), we will find q precisely, but in the presence of noise the sensitivity of 
solution to the code matrix Z could be important. 



3 Error and Code Selection 



Any Z satisfying equation (EJ will minimise Rp, but we may like to find a Z that 
will also minimise the sum square error of q (Rq). Now from (jSI) 

Rp=p^p-2fF' .p + p^p (5) 

and using equation du 

Rp = f.ZZ'^.q - 2f.ZZ^.q + q'^.ZZ^.q (6) 

If we let ZZ^ = m. I, where I is the identity matrix and m a positive integer 
Rp = m.I.{q^ .q — 2cf" .q + q^ .q) = m.I.Rq 



However this corresponds to the one-per-class case since it implies that Z = /, 
which means Z has no error-correcting capability. 

Consider the case that ZZ"’" can be written in the form 





n m • 


• m 


ZZ^ = 


m n • 


• m 




m m • 


• n 



(7) 



Using the fact that ZZ"^ .q is a vector whose elements can be written in the form 
[n — m)qi + m equation 0 can be written as 

Rp = —2{n — m).(f" q + m — 2{n — m).q^q — 2m{n — m).q^q + m — 2<f" .Z Z"’' .q 



so that 

Rp = {n — m).Rq (8) 

Therefore from equation Q, if is in the form given by equation Q, both Rq 
and Rp can be minimised simultaneously. 
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3.1 Equi-Distance Code 

Furthermore, consider the situation that Z is an equi-distance code, so that 
Ell \Za - Z,i\ = 2d for any pair i,j. Since Hamming Distance between pair 
i,j is the sum of the number of ones in row i and row j minus number of common 
ones between i,j we may write 

b b b 

~ 2 Zii.Zji = 2d (9) 

1=1 1=1 1=1 

similar equations can be written for pair i, k and pair j, k 

b b b 

J2Zu + J2Zki-2Y,Zu-Zki = 2d (10) 

b b b 

Zji + ^ Zki — 2 ^ Zji.Zki = 2d 
1=1 1=1 1=1 

From equations ®, (Cni),(IID after re-arranging 

b b b 

ZihZji = ^ Zki-Zji = ^ Zki.Zii = 

where m is number of common bits in code word, and 

b b b 

Y.Zu = Y^ Zm = Y, Zji = n (13) 

Z=1 1^1 1^1 



( 11 ) 

( 12 ) 



where n is the number of ones in each row 

Therefore if Z is an equi-distance matrix, the number of ones in different 
rows are the same, and the number of common ones between any pair of rows 
is equal. But a matrix Z of the form satisfying (Q will have the property of 
equation JED and C3) and will minimise both Rp and Rq simultaneously, since 

b b 

ZZ^ = Y ZaZ^ = Y ZaZji = 

1^1 Z=1 



r n if i=j 
1 m otherwise 



3.2 Variance and Bias 

For ZZ'^ of the form (JT)) the inverse is given by 



C = (zz^)-^ 



Cl C2 • • • C2 
C2 Cl • • • C2 



C2 C2 • • • Cl 



(14) 
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where ci and C 2 can be expressed in terms of to, n, k 

n+ (k — l).m 

n? + {k — 2).m.n — {k — 1).to^ 



( 15 ) 



C2 



—m 

{k — 2).m.n — {k — l).m^ 



(16) 



From equation (0) 





Cl C2 • 


• C2 




Pi 


q = 


C2 Cl • 


• C2 




P2 




C2 C2 • 


• Cl 




Pb 



We assume that individual classifers have same variance of error CTp, and that 
the covariance of expert’s error between any pair is simply p.ap, Then from using 
equation ISI), the final variance can be written 



<^q = (ci - {k- l)c 2 )^.n.cTp(l + (n - l)p) (17) 



From equations m and CED, and knowing that d = n — m (equation (0 
and that for any row of an equi-distance code b = m + n, equation dI3 can be 
written as 



(6 - d)^(l + {b - d - l)p) 

((1 - 2fc)d2 + kbdy 



(18) 



Equation m tells us that final variance increases with correlation among ex- 
perts. Although (I I iSII is not a simple formula, with some simplification we can 
understand how d and b affect aq. If we consider the case of p = 0 for simplicity 



{knd — {k— l)rf 2)2 P 



so that (Tq increases with n, for fixed d. Also aq is reduced if d is increased for 
fixed n. In other words if we use longer words so that b is increased then to 
should also be increased to keep n fixed. 

To determine effect of bias, suppose that local experts provide p + 6 where S 
is the bias. From equation 0, Rp with bias is given by 



b b 

+ 5-pjf = + 2pj5 - 2pj5) 

i=i i=i 

In most applications ~ 0, and if p is an acceptable estimation, 5{p — p) is 
small. Therefore Rp is not sensitive to bias. 
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4 Experimental Results 

4.1 Artificial Data 

We test our ideas on an artificial benchmark in which we can find the result of 
Bayesian classifier as reference and visualise the decision boundaries to show the 
behaviour of ECOC. It is helpful for understanding the behaviour of composite 
system in mimicking the Bayesian classifier. 

Consider five groups of two dimensional random vectors having normal dis- 
tribution as: p{x\ci) = ] for i = 1,2,..., 5 with pa- 

rameters given in tabled 



Table 1. Distribution parameters of data used in artificial benchmark 



class Cl C2 C3 C4 Cs 



Pi (mean) [0,0] [3,0] [0,5] ]7,0] ]0,9] 
cr^( variance) 1 4 9 25 64 



Having a set of patterns consisting of equal number of patterns from each 
group, our goal is to classify them. Our base classifiers are not made by training, 
but using the parameters from table Ewe will just find the posterior probability 
of class (or super-class) membership for each sample. Using equal number of pat- 
terns from each group for test set (equal prior probability for classes); Bayesian 
decision rule says: assign x ^ Wi if P{wi\x) = ArgMaXi{P{ci\x)). P{ci\x) 

is the posterior probability of class membership for class ci, and can be found 
by the Bayesian formula: P{ci\x) = ; in which P{ci) is the prior 

probability of class i and p{x) is the same for all classes. So the decision rule can 
be changed: assign x ^ Ci if p{x\ci) = ArgMaXi{p{x\c)) 

To simulate the behaviour of the system, Gaussian and uniform random data 
are added to the output of experts. For a fair comparison between different meth- 
ods, the noise for each code matrix is produced once and used in all combining 
methods. To find a code with desired properties, we have used BCH method^lj, 
followed by selecting rows using properties (1 1 2D and m- Columns with all zeros 
or ones have been removed, as explained in Sect. 3. 

The following code matrices are used in this experiment (fc = 5): 

Cl: a k X k unitary code( one per class ) 

C2: a fc X 7 matrix with randomly chosen binary elements 
C3: a k X 7 BCH code ( minimum distance of 3, non-equal) 

C4: a k X 7 BCH code with equal distance of 4 
C5: a fc X 15 matrix with randomly chosen elements 
C6: a k X 15 BCH code with equal distance of 8 
C7: a fc X 31 BCH code with equal distance of 16 
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Adding Gaussian noise with variance of 0.5 and zero bias, the classification 
rate of the Bayesian classifier is 71.82% and with zero variance and 0.5 bias 
it is 72.08%, The rates of matching (representing how close the Bayes rate is 
approximated) for original ECOC and LS-ECOC are presented in table 0 

Table 2. Matching rate (% Bayesian) for ECOC and LS-ECOC with added noise 



Code Exp 1 


Exp 2 


Exp 3 


Exp 4 


Exp 5 


Exp 6 


Cl 


100.00 


i 100.00 


57.66 


57.74 


100.00 


1 100.00 


C2 


97.56 


100.00 


65.74 


59.98 


53.88 


86.88 


C3 


97.30 


100.00 


66.40 


60.30 


83.06 


81.52 


C4 


100.00 


i 100.00 


69.04 


69.04 


98.64 


98.64 


C5 


97.08 


100.00 


78.72 


76.02 


87.50 


88.30 


C6 


100.00 


i 100.00 


82.78 


82.78 


100.00 


1 100.00 


C7 


100.00 


i 100.00 


89.50 


89.50 


100.00 


1 100.00 



Exp 1: ECOC with no noise 
Exp 2: LS-ECOC with no noise 

Exp 3: ECOC with Gausian noise (Bias=0, Variance=.5) 
Exp 4: LS-ECOC with Gausian noise (Bias=0, Variance=.5) 
Exp 5: ECOC with Gausian noise (Bias=.5, Variance=0) 
Exp 6: LS-ECOC with Gausian noise (Bias=.5, Variance=0) 



From table El 

1. Without noise, the performance of LS-ECOC for codes with unequal distance 
between rows is better than ECOC (Exp 1 and 2). 

2. In noisy data 

a) For equi-distance codes(Cl,C4,C6,C7) original ECOC and LS -ECOC 
have similar performance (Exp. 3,4 and 5,6). 

b) For codes with unequal distance, ECOC is better in variance reduction 
(Exp. 3 and 4) while LS-ECOC has better performance for added bias 
(Exp. 5 and 6). It seems reasonable that for added bias the distance 
measurement in ECOC is adversely affected since the number of ones 
in code word labels is different. On the other hand, LS-ECOC is less 
sensitive to bias as predicted in Sect. 3. 

3. For longer random codes (C5), it can be expected that on average the number 
of ones in rows is similar and therefore there will be less difference in the 
ability of ECOC and LS-ECOC in handling bias and variance (Exp. 3,4 and 
5,6). 



4.2 Real Data 

We tested Codes C1-C6 on real data for problems from Ej The base classifier is 
an MLP trained by BackPropagation with fixed learning rate, momentum and 
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number of training epochs. The number of hidden nodes of MLP, number of 
training and test patterns and number of classes for the problems are shown in 
table 0 We also compared ECOC and LS-ECOC with Centroid-ECOC 0, which 
is identical to ECOC except distance is calculated to centroid of classes rather 
than to code word label. The mean and standard deviation of classification rates 
for ten independent runs are given in tables 0 and 0 
From tables Eland 13 

1. The combining srategy (ECOC, Cent-ECOC, LS-ECOC) appears to have 
little impact, except for codes C2, C3 with LS. 

2. In all datasets for 7-bit code, equi-distant is best (C2,C3,C4). 

3. Longer codes perform better. However for the 15-bit code, random is better 
for two datasets, while equidistant is better for the other two. 



Table 3. Specification of problems, showing nunber of problems, number of train and 
test patterns, and number of MLP hidden nodes 



Database Class (Num) 


Train (Num) 


Test (Num) 


Nodes (Num) 


ZOO 


7 


50 


51 


1 


car 


4 


50 


1678 


1 


vehicle 


4 


350 


496 


5 


satellite 


6 


1000 


5435 


2 



Table 4. Mean and Std classification rate for ECOC, LS-ECOC and Centroid-ECOC 
on zoo and car data base. 



code ECOC(zoo) Cent(zoo) Lsqu(zoo) ECOC(car) Cent(car) Lsqu(car) 



Cl 


89.54 


89.54 


89.54 


72.15 


72.15 


72.15 




5.99 


5.99 


5.99 


4.83 


4.83 


4.83 


C2 


77.78 


77.78 


43.14 


73.60 


73.60 


72.63 




13.91 


13.91 


6.79 


0.93 


0.93 


1.21 


C3 


88.89 


88.89 


84.97 


72.96 


72.96 


71.99 




6.30 


6.30 


2.26 


3.62 


3.62 


2.79 


C4 


86.27 


86.27 


86.27 


74.16 


74.16 


74.16 




8.98 


8.98 


8.98 


3.70 


3.70 


3.70 


C5 


94.77 


94.77 


94.12 


74.33 


74.33 


74.55 




2.99 


2.99 


3.39 


2.35 


2.35 


2.39 


C6 


93.46 


93.46 


93.46 


72.79 


72.79 


72.79 




2.99 


2.99 


2.99 


2.65 


2.65 


2.65 



156 



R. Ghaderi and T. Windeatt 



Table 5. Mean and Std classification rate for ECOC, LS-ECOC and Centroid-ECOC 
on vehicle and satellite data base. 



code ECOC(veh) Cent(veh) Lsqu(veh) ECOC(sat) Cent(sat) Lsqu(sat) 



Cl 


62.77 


62.77 


62.77 


65.05 


65.05 


65.05 




8.65 


8.65 


8.65 


17.29 


17.29 


17.29 


C2 


66.94 


66.94 


61.22 


80.29 


80.29 


23.91 




4.42 


4.42 


5.54 


6.915 


6.915 


2.30 


C3 


53.02 


53.02 


57.12 


70.06 


70.06 


62.67 




14.90 


14.90 


15.33 


10.42 


10.42 


6.96 


C4 


69.15 


69.15 


69.15 


69.48 


69.48 


69.48 




5.62 


5.62 


5.62 


3.88 


3.88 


3.88 


C5 


73.32 


73.32 


73.72 


77.74 


77.74 


77.74 




2.78 


2.78 


3.82 


4.31 


4.31 


4.98 


C6 


75.34 


75.34 


75.34 


80.43 


80.43 


80.43 




1.81 


1.81 


1.81 


1.73 


1.73 


1.73 



5 Discussion and Conclusion 

We have demonstrated theoretically and practically that LS-ECOC used with 
equi-distant code words may give better performance, at least for shorter codes. 
However as length of code word was increased no performance advantage was 
apparent when comparing ECOC with LS-ECOC. Results on real data confirmed 
that any theoretical advantage of LS-ECOC is not necessarily realised in practice 
if longer codes are used. Comparison of three combining strategies ECOC, LS- 
ECOC and Centroid-ECOC suggest that the combination strategy does not play 
a major role in improving performance. This result lends support to the finding 
of others j0| that the error-correcting capability of a designed code may not be a 
significant aspect of the ECOC method, at least with respect to the combining 
strategies considered here. 

In order to apply ECOC to situations where super-class probabilities are not 
suitable measures by themselves, we conclude that it may be useful to look at 
variants of ECOC. Least Squares represents an alternative combining strategy 
for ECOC that can give comparable classification results to the original distance- 
based strategy. If individual class probabilities are required, LS-ECOC provides 
a method of recovering them. 
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Abstract. One of the main factors affecting the effectiveness of ECOG 
methods for classification is the dependence among the errors of the 
computed codeword bits. We present an extensive experimental work 
for evaluating the dependence among output errors of the decomposi- 
tion unit of ECOG learning machines. In particular, we compare the de- 
pendence between ECOC Multi Layer Perceptrons (ECOC monolithic), 
made up by a single MLP, and ECOC ensembles made up by a set of 
independent and parallel dichotomizers (ECOC PND), using measures 
based on mutual information. In this way we can analyze the relations 
between performances, design and dependence among output errors in 
ECOC learning machines. Results quantitatively show that the depen- 
dence among computed codeword bits is significantly smaller for ECOC 
PND, pointing out that ensembles of independent dichotomizers are bet- 
ter suited for implementing ECOC classification methods. 



1 Introduction 

Error Correcting Output Coding (ECOC) 0 is a two-stage Output Coding (OC) 
decomposition method (HHI that has been successfully applied to several classi- 
fication problem m In its first stage it decomposes a multiclass classification 
problem in a set of two-class subproblems, and in a second stage recomposes the 
original problem combining them to achieve the class label. 

ECOC methods present several open problems such us the tradeoff between 
error recovering capabilities and learnability of the dichotomies induced by the 
decomposition scheme A connected problem is the analysis of the relation 
between codeword length and performances j^j, while the selection of optimal 
dichotomic learning machines and the design of optimal codes for a given mul- 
ticlass problem are other open questions subject to active research 0. 

Another problem tackled by different works is the relation between per- 
formances of ECOC and dependence among output errors. In the framework of 
coding theory Peterson m has shown that the error recovering capabilities of 
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ECOC codes hold if there is a low dependence among codeword bits. In particu- 
lar, in a previous work jHj we qualitatively identify the dependence among output 
errors as one of the factors affecting the effectiveness of ECOC decomposition 
methods. In that work we outlined that we would expect an higher dependence 
among codeword bits in monolithic Error Correcting Output Coding EE] (ECOC 
monolithic for short) compared with ECOC Parallel Non linear Dichotomizers 
(PND) El (ECOC PND for short) learning machines, considering that ECOC 
monolithic share the same hidden layer of a single MLP, while PND dichotomiz- 
ers, implemented by a separate MLP for each codeword bit, have their own layer 
of hidden units, specialized for a specific dichotomic task. 

The aim of this work is to quantitatively test if the dependence among output 
errors between ECOC monolithic and ECOC PND is significantly different. In 
particular, we perform an extensive experimentation for comparing the depen- 
dence among output errors of the decomposition unit of ECOC monolithic and 
ECOC PND using measures based on mutual information 0, in order to eval- 
uate if a low dependence among output errors is related to better classification 
performances. 

The paper is structured as follows. In the next section we summarize the 
main characteristics of the measures based on mutual information we propose 
for evaluating the dependence among output errors in learning machines. Sect. El 
presents the experimental setup, the results and the discussion about the quan- 
titative comparison of dependence among output errors between ECOC mono- 
lithic and ECOC PND learning machines. The conclusions summarize the main 
results and the incoming developments of this work. 



2 Mutual Information Based Measures of Dependence 
among Output Errors 

In this section we present a brief overview of the mutual information based mea- 
sures for evaluating the dependence among output errors in learning machines. 
A more detailed discussion can be found in jOj. 

The main idea behind the evaluation of dependence among output errors of 
learning machines through mutual information based measures consists in inter- 
preting the dependence among the outputs as the common information shared 
among them. Mutual information takes into account the marginal and joint prob- 
ability distributions of the output errors, measuring in a sense the information 
shared among them. Using standard statistical measures such as the covariance 
or the coefficient of correlation we estimate only the linear relation between out- 
put errors. Conversely, a suitable measure of dependence must evaluate directly 
the probability distribution of the output errors in order to properly evaluate 
the stochastic independence between random variables. Mutual information, be- 
ing a special case of the Kullback-Leibler divergence between two distributions, 
measures the matching between the joint density distribution and the product 
of the marginal density distribution of the output errors. If we a have a complete 
matching, the mutual information is 0 and the output errors are independent. 
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otherwise higher is the value of the mutual information between output errors, 
higher will be the dependence between them. 

The first measure based on mutual information we define is the mutual in- 
formation error I E'- 



b b 

■ ■ ■ ,ei) = E-E p{eij,,...,eij,)log 

ii=i ii=i 



I ■ 1 ) 



( 1 ) 



where p{eij^ ; ■ • ■ ; ey, ) is the discrete joint probability distribution among all the I 
output errors and p{oijf) is the discrete probability distribution of the output 
error, with i € and with the ji G {1 ,...,6 } corresponding to the 

discretization of the output errors in b intervals. The mutual information error 
(eq.IU expresses the dependence among all output errors of a learning machine. 
If it is equal to 0 then the distributions of the output errors are statistically 
independent. It expresses also how are similar the probability distribution of the 
output errors. 

Considering the outputs of a learning machine correct if their errors are below 
a certain threshold, i.e if Vz, ei < 6, <5 > 0, we define the mutual information 
specific error Is E- 



S !>(««. los 

where 

= I [jii ■ ■ • ji]\^{jv,jw)\{jv 1) {jw I 

with v^w G {1...Z}. This measure takes into account the output errors only 
when two or more errors spring from the output, disregarding all cases with no 
errors or with only one error. For evaluating the dependence among specific pairs 
of output errors, we introduce the pairwise mutual information error matrix R 
composed by the elements lE{^i,ej) = [Rij] and the pairwise mutual information 
specific error matrix S, composed by the elements IsE{^i,ej) = [Sy]. We then 
define also two other global indices: the pairwise mutual information error matrix 
index <Pji: 

i i 

^R = '^'^lE{ei,ej) (3) 

i=i j=i 

and the pairwise mutual information specific error matrix index ^ 5 : 

i i 

^5 = E E ^SE{ei, ej) (4) 

j=i 

These indices measure the sum of the the mutual information error and the 
mutual information specific error between all the output pairs of the learning 
machines, and in this sense can be regarded as global measures of dependence 
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Table 1. Main features of the data sets. 



Data set 


Number of Number of 


Number of 


Number of 




attributes 


classes 


training samples testing samples 


d5 


3 


5 


30000 


30000 


glass 


9 


6 


214 


10-fold cross- val 


letter 


16 


26 


16000 


4000 


optdigits 


64 


10 


3823 


1797 



between output errors. Note that these indices (Eq. 0 and 01) are not equivalent 
to the corresponding Eq. 0 and 0 of the mutual information among all output 
errors: Eq. 0and0consider only the mutual information between pairs of output 
errors, while Eq. 0 and 0 consider the overall mutual information among all 
output errors. 

These mutual information related quantities can be used to compare the 
dependence of the output errors among different learning machines on the same 
learning problem, using, of course, the same data sets. 

3 Experimental Results 

In this section we present a quantitative comparison of the dependence among 
output errors of the decomposition unit of ECOC monolithic and ECOC 
PND learning machines, and we analyze the relations between performances, 
design and dependence among output errors. For this purpose we experimentally 
compare the mutual information error Ie, the mutual information specific error 
IsE and the pairwise indices <1>b. and <P[i (Sect.0) of the ECOC monolithic and 
PND learning machines using different data sets. 



3.1 Experimental Setup 

We have used four different data sets: the first one, d5 0 is generated by NEU- 
RObjects m, a set of CH — h library classes for neural networks development, and 
the other three, glass, letter and optdigits are from the UCI machine learning 
repository of Irvine Cl- The synthetic data set d5 is made up by five three- 
dimensional classes, each composed by two normal distributed disjoint clusters 
of data. The main characteristics of the data sets are shown in Tab. 0 
In order to perform training and testing of the considered learning machines, we 
have applied multiple runs of different random initializations of weights using a 
single pair of training and testing data sets and k-fold cross validation methods. 
The results are summarized in Tab0 errors on the test set are expressed as 
percent rates, and for each data set the minimum (min), average (mean), and 
standard deviation (stdev) of the error is given. We have used, both for training 



1 



d5 is on line available at ftp://ftp.disi.unige.it/person/ValentiniG/Data. 
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the learning machines and for evaluating the dependence among the output 
errors the software library NEURObjects m- 

We have compared the dependence among output errors of ECOC monolithic 
and ECOC PND learning machines varying the structure (number of hidden 
units), the number of discretization intervals of the output errors, and the values 
of 5 (Sect. 13) that define the notion of ” correctness” of the outputs. 

3.2 Results and Discussion 

In this section we present the results of the comparison of I e and Ise among 
all outputs, of the and <Ps pairwise indices and the comparison of R and S 
matrices. 

In Fig. ^we compare Ie and Ise among all output errors of the monolithic 
and ECOC PND learning machines on the data sets d5 and glass. On the axes 
are represented the computed Ie (Fig.Qa and b) and Ise (Fig.^c and d) values. 
Each point corresponds to a different triplet number of hidden units, number of 
intervals and values of S. We point out that all points are above the dotted line, 
showing that both Ie (Fig. Qa and b) and Ise (Fig. Qc and d) are greater for 
ECOC monolithic respect to ECOC PND, no matter the structure, the number 
of intervals and the 5 values used. Fig. |3 shows that on all the data sets about all 
the points are above the dotted line, i.e. all the values of <I>e are greater for ECOC 
monolithic compared with ECOC PND. Similar results hold also considering the 
<I>S index. The examination of the pairwise mutual information error matrices 
can provide us with information about the dependence of specific pairs of output 
errors. The S and R matrices are represented as triangular matrices, without 
the diagonal, because they are symmetric and the elements on the diagonal are 
the entropy of output errors. 

Comparing the mutual information matrices of ECOC monolithic and 
PND learning machines, we find that about all the pairwise mutual information 
errors are higher in ECOC monolithic, on the d5 data set no element of the R 
matrix is higher for PND and only 1 of 21 is higher considering the S matrix; 
on optdigits only 3 of 91 both for R and S matrices are higher, and no element 
of the 435 composing the triangular matrices R and S is higher for PND on 
letter data set. 

Table 2. Performance of ECOC monolithic and ECOC PND ensemble on four data 
sets (percent error rates). 





ECOC monolithic 


ECOC PND ensemble 


Data set 


min 


mean 


stdev 


min 


mean 


stdev 


d5 


13.27 


18.31 


6.44 


11.91 


12.34 


0.74 


glass 


33.18 


36.17 


4.54 


30.37 


32.05 


1.77 


letter 


4.95 


6.55 


1.91 


3.05 


3.24 


0.24 


optdigits 


2.61 


3.08 


0.47 


1.89 


1.95 


0.10 



ECOC monolithic ECOC monolithic 
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d5 compared Glass compared Ie 




0123456701234567 



ECOC PND ensemble ECOC PND ensemble 

(c) (d) 



Fig. 1. Compared mutual information error Ie and mutual information specific error 
IsE among all outputs between ECOC monolithic and PND learning machines on d5 
(a)(c) and glass (b)(d) data sets. 








ECOC monolithic ECOC monolithic 
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(a) (b) 




(c) (d) 



Fig. 2. Compared mutual information error matrix indices <1>r between ECOC mono- 
lithic and PND learning machines on d5 (a), glass (b), optdigits (c) and letter (d) data 
sets. 
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Fig. 0 shows the relations between error rates and mutual information based 
measures Ie and Ise considering the d5 data set. Both Ie and Ise curves of 
ECOC PND ensemble lie below the corresponding curves of ECOC monolithic 
learning machines: These figures confirm that the dependence among output 
errors is smaller for ECOC PND. It is worth noting that, as expected, Ie and Ise 
grow with error rates, but their values are mostly related to a specific learning 
machine architecture. 

We have seen that all the results relative to the mutual information error 
Ie and the mutual information specific error I$e among all the outputs on the 
data sets d5 and glass show greater values for ECOC monolithic respect to ECOC 
PND (Fig. Pi. These results are confirmed by the evaluation of the mutual in- 
formation error matrix indices <!>ii and <Is (Fig.EJ, concerning also the optdigits 
and letter data sets. The analysis of the pairwise mutual information matrices R 
and S converges on showing that also about all the Ie and Ise values between 
each pair of output errors are greater for ECOC monolithic learning machines. 
Moreover, applying the mutual information error t-test jOj for evaluating the 
significance of the differences between the Ie and Ise values of the two ECOC 
learning machines, we have verified that in almost all the comparisons we have 
registered a significant difference with a degree of confidence of 95%. 

Consequently the experimental results on the selected data sets confirm that 
ECOC Parallel Non linear Dichotomizers show a lower dependence among the 
output errors of their decomposition unit compared with the output errors of 
the corresponding ECOC monolithic multi layer perceptron. 



4 Conclusions 

In this paper, we have compared the dependence among output errors between 
ECOC monolithic MLP and ECOC PND learning machines using measures 
based on mutual information. 

The measurements of the mutual information error Ie, the mutual informa- 
tion specific error Ise and the mutual information error matrix indices <P}i and 
I>s show that ECOC PND have a lower dependence among the output errors of 
their decomposition unit compared with the output errors of the corresponding 
ECOC monolithic MLP. Hence ECOC PND ensembles appear more suited to 
exploit the error recovering capabilities of ECOC methods, whose effectiveness 
depends on the independence among codeword bits errors IHEl- 

The observed difference in the dependence among output errors is related to 
the different design of the two learning machines and in particular to the design 
of the decomposition unit. Our experimentation suggests that a low dependence 
can be achieved implementing the decomposition unit through an ensemble of 
parallel and independent dichotomizers, such as the dichotomic MLPs proposed 
in our experimentation, or other suitable dichotomizers such as decision trees or 
support vector machines. 

An ongoing development of this work consists in quantitatively studying 
how boosting methods can increase the diversity among the dichotomizers and 
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the independence among output errors in ECOC learning machines, using the 
proposed measures based on mutual information, and extending them to evaluate 
the diversity between the base learners. 



Acknowledgments. We would like to thank the anonymous reviewers for their 

comments and suggestions. This work has been partially funded by Progetto 

finalizzato CNR-MADESS II, INFM and University of Genova. 

References 

1. E.L. Allwein, R.E. Schapire, and Y. Singer. Reducing multiclass to binary: a 
unifying approach for margin classifiers. In Proc. ICML’2000, The Seventeenth 
International Conference on Machine Learning, 2000. 

2. A. Berger. Error correcting output coding for text classification. In IJCAPOQ: 
Workshop on machine learning for information filtering, 1999. 

3. Y. Crammer and Y. Singer. On the learnability and design of output codes for 
multiclass problems. In Proceedings of the Thirteenth Annual Conference on Com- 
putational Learning Theory, pages 35-46, 2000. 

4. T.G. Dietterich and G. Bakiri. Solving multiclass learning problems via error- 
correcting output codes. Journal of Artificial Intelligence Research, (2):263-286, 
1995. 

5. R. Ghani. Using error correcting output codes for text classification. In ICML 2000: 
Proceedings of the 1 7th International Conference on Machine Learning, pages 303- 
310, San Erancisco, US, 2000. Morgan Kaufmann Publishers. 

6. V. Guruswami and A. Sahai. Multiclass learning, boosting, and error-correcting 
codes. In Proc. of the Twelfth Annual Conference on Computational Learning 
Theory, pages 145-155. ACM Press, 1999. 

7. E. Kong and T.G. Dietterich. Error - correcting output coding correct bias and 
variance. In The XII International Conference on Machine Learning, pages 313- 
321, San Erancisco, CA, 1995. Morgan Kauffman. 

8. F. Masulli and G. Valentini. Effectiveness of error correcting output codes in 
multiclass learning problems. In Lecture Notes in Computer Science, volume 1857, 
pages 107-116. Springer- Verlag, Berlin, Heidelberg, 2000. 

9. F. Masulli and G. Valentini. Mutual information methods for evaluating depen- 
dence among outputs in learning machines. Technical Report TR-01-02, DISI - 
Dipartimento di Informatica e Scienze dell’ Informazione - Universita di Genova, 
2001. ftp://ftp.disi.unige.it/person/ValentiniG/papers/TR-01-02.ps.gz. 

10. E. Mayoraz and M. Moreira. On the decomposition of polychotomies into di- 
chotomies. In The XIV International Conference on Machine Learning, pages 
219-226, Nashville, TN, July 1997. 

11. G.J. Merz and P.M. Murphy. UCI repository of machine learning databases, 1998. 
www.ics.uci.edu/mlearn/MLRepository.html. 

12. W.W. Peterson and E.J.Jr. Weldon. Error correcting codes. MIT Press, Gambridge, 
MA, 1972. 

13. G. Valentini and F. Masulli. NEURObjects, a set of library classes for neural 
networks development. In Proceedings of II A’99 and SOCO’99, pages 184-190, 
Millet, Canada, 1999. ICSC Academic Press. 




Information Analysis of Multiple Classifier 

Fusion’^ 



Jifi Grim^, Josef Kittler^, Pavel Pudil^, and Petr Somol^ 

^ Institute of Information Theory and Automation, 

P.O.BOX 18, CZ-18208 Prague 8, Czech Republic, 

{grim, pudil , somol}@utia. cas . cz 

^ School of Electronic Engineering, Information Technology and Mathematics, 
University of Surrey, Guildford GU2 5XH, United Kingdom 



Abstract. We consider a general scheme of parallel classifier combina- 
tions in the framework of statistical pattern recognition. Each statistical 
classifier defines a set of output variables in terms of a posteriori prob- 
abilities, i.e. it is used as a feature extractor. Unlike usual combining 
schemes the output vectors of classifiers are combined in parallel. The 
statistical Shannon information is used as a criterion to compare different 
combining schemes from the point of view of the theoretically available 
decision information. By means of relatively simple arguments we derive 
a theoretical hierarchy between different schemes of classifier fusion in 
terms of information inequalities. 



1 Introduction 



A natural way to solve practical problems of pattern recognition is to try different 
classification methods, parameters and feature subsets with the aim to achieve 
the best performance. However, as different classifiers frequently make different 
recognition errors, it is often useful to combine multiple classifiers in order to 
improve the final recognition accuracy. In the last few years various combination 
methods were proposed by different authors (cf. for extensive references) . 

The most widely used approach typically combines the classifier outputs di- 
rectly by means of simple combining rules or functions. It relates to techniques 
like majority vote, threshold voting, averaged Bayes classifier, different linear 
combinations of a posteriori probabilities, maximum and minimum rules, prod- 
uct rule (cf. e.g. mm) and also more complex combining tools like fuzzy logic 
or Dempster-Shafer theory of evidence (cf. e.g. unnHi). 

Another approach makes use of classifiers as feature extractors. The extracted 
features (e.g. a posteriori probabilities) are used simultaneously in parallel to 
define a new decision problem (cf. [II 611 8171^ 1. Instead of a simple combining 
function the new features are evaluated by a classifier again (e.g. by a neural 
network) to realize the compound classification. This approach is capable of very 
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general solutions but the potential advantages are achieved at the expense of the 
lost simplicity of the combining rules. 

In the present paper we consider a general scheme corresponding to the last 
type of parallel classifier combinations in the framework of statistical pattern 
recognition. In particular we assume that each statistical classifier defines a set 
of output variables in terms of a posteriori probabilities which are simply used 
in parallel as features. We use the term parallel classifier fusion to emphasize 
the difference with the combining functions. 

The standard criterion to measure the quality of classifier combination tech- 
niques is the recognition accuracy. In this paper we use the statistical Shannon 
information to compare different combining schemes as it is more sensitive than 
the classification error and easily applicable. By means of relatively simple facts 
we derive a theoretical hierarchy between basic schemes of classifier fusion in 
terms of information inequalities. The results have general validity but their 
meaning is rather theoretical since the compared decision information is only 
theoretically available in the new compound feature space. 

In Section Owe describe the framework of statistical pattern recognition and 
introduce the basic concept of information preserving transform. In Section 0 
we discuss the information properties of imprecise classifiers and show how the 
information loss of practical solutions can be reduced by the parallel fusion (Sec. 
0). In Sec. 0 we compare the parallel classifier fusion with a method based on 
general combining rules. The obtained hierarchy of different methods of multiple 
classifier fusion is discussed in Sec. El and finally summarized in the Conclusion 
(Sec. 0. 

2 Information Preserving Transform 

Considering the framework of statistical pattern recognition we assume in this 
paper that some TV-dimensional binary observations x have to be classified into 
one of mutually exclusive classes w £ 17: 



The observations x £ X are supposed to occur randomly according to some a 
priori probabilities p{o->) and class-conditional probability distributions 



Given the probabilistic description of classes, we can compute for any input 
vector X the a posteriori probabilities 



X = {xi,. . . ,xn) & X, T = {0,1}^, [2 = {uji,. . . ,ujk}- 



P{x\uj)p{lo), X G X, u} G Q. 



( 1 ) 







where P{x) is the unconditional joint probability distribution of x. The a pos- 
teriori probabilities p{pj\x) contain all statistical information about the set of 
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classes fl given x € X and can be easily used to classify uniquely the input 
vector X, if necessary. 

For the sake of information analysis let us consider the following vector trans- 
form T of the original decision problem defined on the input space X : 

T-.X^y, ycR^, y = T{x) = {Ti{x),...,TK{x))Gy (3) 

yk = Tk{x) = (fi{p{ujk\x)), xGX, k=l,...,K. (4) 

where p is any bijective function. Let us remark that there are strong arguments 
to specify the function ip as logarithm in connection with neural networks j‘2l4j . 
The transform ( 0 , m naturally induces a partition of the space X 

S = {Sy,y€y}, Sy = {x€X:T{x)=y}, \J Sy = X (5) 

vey 

and transforms the original distributions Px,Px\u> on the input space X to the 
distributions Qy,Qy\u> on y-. 

Qy{y) = E = P^^Sy), yey, (6) 

X£Sy 

Qyi^ivl^^) = X! uj & n. (7) 

X£Sy 

Throughout the paper we use, whenever tolerable, the more simple notation 
P{x) = Px{x), P{x\u;) = Px\u{x\u;), Q{y) = Qy{y), Q{y\uj) = 

In analogy with 0 we can write 

= ( 8 ) 
Q[y) 

It is well known that, from the point of view of classification error, the a 
posteriori probabilities represent optimal features (cf. |Q). Moreover, it has been 
shown that the transform defined by Eqs. 0 preserves the statistical decision 
information and minimizes the entropy of the output space (cf. |'2H vj ) . In partic- 
ular, if we introduce the following notation for the unconditional and conditional 
Shannon entropies 

H{po) = H{po\^) = -p{u;\x) log p{u;\x), (9) 

Lj G -T? r? 

H{pn\Px) = Y Pi^)H{pn\.), H{pa\Qy) = Qiy)H{pa\y), (10) 
xex yey 

then we can write 



I{Px,Pn) = H(pn) - H{pn\Px) = H{pa) - H{pa\Qy) = I{Qy,Pa). ( 11 ) 
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Let us recall that 

Sy = {x€X: T{x) =y} = {x&X \ p{ujk\x) = fc = 1, . . . , AT} (12) 

and therefore the distributions po\x are identical for all x S Sy. Thus, in view 
of the definition (0), we can write for any to G f2,y £ y and for all x G Sy'. 

, I X Q{y\^)p{^^) P{Sy\uj)p{u;) P{x) 

PMy) = 7T777\ = = 2^ -^P^PMx) =P(uj\x). (13) 



Qiy) P(Sy) 

Consequently, we obtain equation 






PiSyY 



H{Pn\Qy) = E P{Sy)H{pa\y) = E E P{x)H{pn\x) = H{pn\P;,) (14) 

yey y&y x^Sy 

which implies Eq. CB- In words, the transform T preserves the statistical Shan- 
non information I{Px,Pn) about pa contained in Px (cf. |2I17| 1. 



3 Information Loss Caused by Imprecise Classifiers 

Let Pp^ Y be some estimates of the unknown conditional probability distributions 
Px\u. obtained e.g. in I different computational experiments: 

iGT, i={i,...,J}. (15) 

We denote p^'^\uj\x) the a posteriori probabilities which can be computed from 
the estimated distributions 

p^^\u;\x) = ^ P^"\x) = P^"Hx\u})p{uj), xGX. (16) 

Here and in the following Sections we assume the a priori probabilities p{io) to 
be known and fixed. In analogy with &M we define 

= T«(a;) = (T^^(x),..., T^^(®)) G yW, 

Vk^ = Pk\^) = ‘PiP^"'^i^k\x)), xGX, k = l,...,K, {iGl). (17) 
Again, the transform induces a partition of the input space X : 

V) ={®e A’:rW(a^)=y(®)} (18) 

and transforms the original true probability distributions Px,Px\ui on ^ to the 
corresponding distributions Qy\i),Qy\i)^^ on y (cf. Q, ( 0 ): 

Q%,{y^^Y = PASyGY, Q%^Jy^^^\u;) = Px\USyM, y^G^®). (19) 
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In analogy with m we can write for any to & Q and 

P{x) 



y(w|yW) = p{u>\Syii)) = ^7-5— y = ^ 



PiSyii',) 



x^S 



(i) 



P(Syi.,) 



p{uj\x). (20) 



However, unlike Eq. (H3, the probabilities p{uj\x) need not be identical for 
all X £ Sy(i) because the partition 5^*^ derives from the estimated distribu- 
tions which may differ from the unknown true distributions Px\uj- Thus, 

for some Sy(i) £ SP\ there may be different vectors x,x £ Sy^i) such that 
p{uj\x) ^ p{ijj\x ) for some a; £ 17. Consequently, we can write the following 
general Jensen’s inequality for the convex function ^log^: 

- p{uj\Sy(.))\ogp{u)\Syii)) > ^ -^^^^[-p(w|a;)logp(u;|a;)]. (21) 

Multiplying the inequality CD) by P(Sy{i)) and summing through w £ 17 and 
yd) e j;(d we obtain the following inequality for conditional entropies 

H{pn\Q%)= Y. ^P(a=)i^(pr 2 |x) = ff(pr 2 |P;r). 

( 22 ) 

In view of Eq. H{pa\Px) = H{pa\Qy) (cf. (|n|) it follows that 

I{Px.Po) = nQy,Pn)>I{Q%„Po), (i£l). (23) 

Thus, if the true probabilistic description Px\un^ £ 17 is unknown and we are 

(i) 

given only some estimated distribution then we may expect the transform 

r(i) 

to be accompanied with some information loss. In other words, as it is well 
known, the extracted features y^\x) usually contain only a part of the original 
decision information. 



Remark 4.1 It should be emphasized at this point that there is an important 
difference between the present concept of information analysis and a practi- 
cal problem of pattern recognition. In a practical situation we would use the 

(z) 

estimated conditional distributions R^|(jto compute a posteriori probabilities 
p^'^\uj\x) and finally the decision d^*)(a:) £ 17 given an input observation a; £ ff . 
However, in case of the above information analysis, we use the estimated dis- 
tributions P^^^^ only to define the related transform (feature extractor) 

The resulting information inequality (H.'lll compares the original “complete” de- 
cision information I{Qy,pn) and the theoretically available information content 
I{Qy\i)TPo) of the new features (cf. (EJ) . Note that the true statistical 
properties of the new features expressed by the distributions oan- 

not be deduced from the estimated conditional distributions Pb, . It would be 

PC \U} 

necessary to estimate them again from the training data. This remark applies in 
analogous way to all transforms considered in the following Sections. 
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4 Parallel Classifier Fusion 

The inequality (123) suggests possible information loss which may occur in prac- 
tical solutions. One possibility to reduce this information loss is to fuse multiple 
classifiers in parallel. In particular, considering multiple estimates of the 
unknown conditional distributions G fi v/e can use the corresponding 

transforms € I, (cf. dTZJ) simultaneously to define a compound trans- 

form T: 

y^f{x) = {t[^\x), . . . , T«(a;), . . . , ^(x), . . . , tI^\x)) g (V (24) 

y« = Tf(a;) = (^(pW(u;fc|x)), x G X , k = 1, . . . , K, iGl. (25) 

In this sense the joint transform T can be viewed as a parallel fusion of the 
transforms Again, the transform T induces the correspond- 

ing partition of the input space X : 

5 = {^y,ye3>}, Sy = {xGX-.T(x)^y} (26) 

and generates the transformed distributions Qy-,Qy\^ on y : 

Qyiv) = Px{Sy), Qy^^{y\uj) = Px\u>{Sy\uj), yGy, ujGQ. (27) 

It is easy to see (cf. (II iSI ) that the partition 5 of A can be obtained by intersecting 
the sets of the partitions 

Sy = {xGX ■.T^^{x) = y^^,iGT} ^ (28) 

iGI 

and therefore the partition 5 is a refinement of any of the partitions S^'‘\ i gT. 
Now we prove the following simple Lemma: 

Lemma 4.1 Let Qy,Qy\oj be discrete probability distributions defined by the 
partition S (cf. (©idZl)) and QyTQy\ui discrete probability distributions defined 
by the partition S\ 

Qy{y) = Px(Sy), Qy^^mco) = PxiUSy\to), yGy, WGI?. (29) 

Further let 5 be a refinement of the partition S. Then the statistical decision 
information about pa contained in Qy is greater or equal to that contained in 
Qy, i.e. we can write the inequality 

I{Qy,Pn) > I{Qy,pn)- (30) 



Proof. We use notation (cf. (0),(0 and (£3) 

P{Sy\u})p{u:) 

' P{Sy) 



P{Sy\w)p{iV) 

P{Sy) 



p{^\y) 



p(w|y) 



(31) 
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and recall that for any two subsets Sy € S, Sy € S it holds that either Sy C Sy 
or their intersection is empty: 5^ P| = 0. It follows that we can write for any 

v&y 



p(w|y) 



E 

y^y 



PjSyf^Sy) P{Sy\u))p{uj) 
P{Sy) P{Sy) 



E 

v^y 



P{Sy n Sy) 

P{Sy) 



■p(w|y). 



(32) 



Applying Jensen’s inequality to the function — ^log^ and considering (1,4211 we 
obtain 



- p{uj\y) log p{uj\y) > 



E 

v^y 



PjSy ("I Sy) 
P{Sy) 



[-p{uj\y)logp{uj\y)]. 



(33) 



Further, multiplying the inequality ( I44|l by P{Sy) and summing through lu G f2 
and y G y, we obtain the following inequality for conditional entropies 



H{po\Qy) = ^ Q{y)H{po\y) > ^ Q{y)H{pQiy) (34) 

yey yey 

which implies the inequality (CT- • 



Consequently, since the partition 5 is a refinement of any of the partitions 
S^y,i G X, Lemma 4.1 implies the following information inequality 

I{Qy.Pn)>I{Q%,,Pn), * £ I- (35) 

We can conclude that, expectedly, the classifier fusion represented by the com- 
pound transform T preserves more decision information than any of the compo- 
nent transforms . Let us remark, however, that the dimension of the feature 
space y produced by T is /-times higher than those of y^''\ This computational 
aspect of the considered form of classifier fusion will be discussed later in Sec. El 



5 Combining Functions 

Now we return to the most widely used classifier combination scheme based on 
simple combining functions or combining rules. In particular we assume that the 

(i) 

a posteriori distributions Pq\^ ) * £ ^ computed by different statistical classifiers 
are transformed to a single //-dimensional output vector by means of some 
combining functions. We denote by T the resulting transform 

f-.x^y, ycR^, y = f{x) = {fiix),...,fK(x))Gy, 

Pk= fk(x) XGX, k=l,...,K (36) 

whereby X>k '■ t R are arbitrary mappings which uniquely define the output 

variables pk as a function of the a posteriori distributions i G I. Let us note 
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that, in this way, we can express in a unified manner various combining rules 
for a posteriori distributions like average, median, product, weighted average 
and others. More generally, the mapping may be described e.g. by a simple 
procedure and, in this way, we can formally describe different voting schemes 
like majority voting, weighted voting, etc. 

If we define the partition of X induced by the transform T 

S = {Sy,y^y}, Sy = {x€X ■.f{x)=v} (37) 

we can see that, for any particular type of the mappings the partition S of 

Sec. E]is a refinement of the partition S. To verify this property of S we recall 

(i) 

that the a posteriori distributions are identical for any a; S C, C € S 
(cf. Il25ll . (l2t)ll l. Consequently, in view of the definition mitill . we obtain identical 
vectors y ~ T{x) for all x S C and therefore C C Sy. In other words, for each 
subset C £ S there is a subset Sy £ S such that C C Sy, i.e. the partition 

5 is a refinement of the partition S. If we denote Qy,Qy\^^ the transformed 
distributions on y defined by the partition S: 

Qyiy) = Px(Sy), Qy^Jy\uj) = P;,\^iSy\u), y£^, uj£Q (38) 

then we can write the information inequality (cf. Lemma 4.1): 

HQy^Pn) > HQy,Po)- (39) 

We can conclude that, expectedly, the classifier fusion represented by the com- 
pound transform T preserves more decision information than any transform of 
the type T based on the combining functions (111 till . 

6 Discussion 

Summarizing the inequalities derived in the above Sections we recall that the 
transform T{x) (cf. 0 , ( 0 |) based on the true probability distributions is 
information preserving in the sense of Eq. (HU- Section El describes a practical 

(i) 

situation when only some estimates P^'^^ of the true conditional distributions 
are available. We have shown that the transform defined by means of the 

(i) 

estimated distributions P^^^ may be accompanied with some information loss, 
as expressed by the inequalities (cf. (|23I)) 

nQy,Po)>nQ%„Pn), i£T. (40) 

The potential information loss 63) can be partly avoided by combining classi- 
fiers. In particular, by parallel fusion of the transforms £ X, we obtain 

the compound transform T and the corresponding transformed distribution Qy 
satisfies the inequality (E3). Consequently, as the general inequality (EOl) can be 
proved for the distribution Qy without any change, we can write 

I{Qy,Po) > I{Qy^Pn) > I{Q%),Po), i e P- 



(41) 
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In other words, the compound transform T preserves more decision information 
than any of the component transforms but the dimension of the feature 
space y produced by T is /-times higher than the dimension of each of the 
subspaces 

In view of the inequality (cf. Id hi 1 

I{Qy.Pn)>HQy.Po) (42) 

the parallel classifier fusion preserves more decision information than various 
methods based on combining functions. However, it should be emphasized that 
the inequality i2D compares the decision information theoretically available by 
means of parallel classifier fusion and by using combining functions respectively. 
Moreover, the information advantage of parallel fusion is achieved at the ex- 
pense of the lost simplicity of the combining rules. In order to exploit the avail- 
able decision information it would be necessary to design a new classifier in the 
high-dimensional feature space y. On the other hand, in the feature space y of 
the combined classifier a large portion of the decision information may be lost 
irreversibly. 

Let us remark finally that inequalities analogous to (14 1 II can be obtained in 
connection with probabilistic neural networks (PNN) when the estimated con- 
ditional distributions have the form of distribution mixtures 0. For each 
PNN^*) we have a transform defined in terms of component distributions 0 
which preserves more decision information about pQ than the corresponding 
transform Again, the underlying information loss connected with the in- 

dividual neural networks can be reduced by means of parallel fusion. It can be 
shown that, theoretically, the parallel fusion of PNN is potentially more efficient 
that the classifier fusion of Section 01 

7 Conclusion 

For the sake of information analysis we consider a general scheme of parallel 
classifier combinations in the framework of statistical pattern recognition. For- 
mally each classifier defines a set of output variables (features) in terms of the 
estimated a posteriori probabilities. The extracted features are used in parallel 
to define a new higher-level decision problem. By means of relatively simple facts 
we derived a hierarchy between different schemes of classifier fusion in terms of 
information inequalities. In particular, we have shown that the parallel fusion of 
classifiers is potentially more efficient than the frequently used techniques based 
on simple combining rules or functions. However the potential advantages are 
achieved at the expense of the lost simplicity of the combining functions and of 
the increased dimension of the new feature space. Thus, unlike combining func- 
tions, the most informative parallel combining schemes would require to design 
a new classifier in a high-dimensional feature space. 
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Abstract. The aim of this paper is to propose a simple procedure that 
a prion determines a minimum number of classihers to combine in order 
to obtain a prediction accuracy level similar to the one obtained with the 
combination of larger ensembles. The procedure is based on the McNe- 
mar non-parametric test of significance. Knowing a priori the minimum 
size of the classifier ensemble giving the best prediction accuracy, con- 
stitutes a gain for time and memory costs especially for huge data bases 
and real-time applications. Here we applied this procedure to four mul- 
tiple classifier systems with C4.5 decision tree (Breiman’s Bagging, Ho’s 
Random subspaces, their combination we labeled ‘Bagfs’, and Breiman’s 
Random forests) and hve large benchmark data bases. It is worth notic- 
ing that the proposed procedure may easily be extended to other base 
learning algorithms than a decision tree as well. The experimental results 
showed that it is possible to limit significantly the number of trees. We 
also showed that the minimum number of trees required for obtaining 
the best prediction accuracy may vary from one classifier combination 
method to another. 



1 Introduction 



Many methods have been proposed for combining multiple decision trees to 
improve prediction accuracy |4I6I81 1 1 1 121 1 ,8122] i . These classifiers are weakened 
to commit errors in a different way so that their combination can correct the 
mistakes an individual makes [III lliqiTT^ . The main experimental studies quoted 
above applied systematic methods to combine hundreds of classifiers and then 
did not limit a priori the number of trees to combine. 
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As far as we know, optimizing the number of classifiers to combine is an 
open question in the literature about the improvements of MCSs’ design. This 
number has to be large enough to create diversity among the predictions but it 
may exist a number beyond which the prediction accuracy remains the same or 
even decreases with respect to a given criterion. Giacento and Roli 0 proposed 
to select among a large set of classifiers an optimal subset of both diverse and 
accurate classifiers of different types (neural and statistical classifiers) . 

This approach combines both a systematic design and an ‘overproduce-and- 
choose’ strategy which is a problem simpler than generating accurate and diverse 
classifiers ‘directly’. Here we propose a simple procedure based on a direct non- 
parametric test of comparison, the McNemar test. The procedure systematically 
determines a minimum number of weakened classifiers to combine for a given 
data base. It does not require the overproduction of classifiers and does not 
select better classifiers than others with respect to a given criterion such as pro- 
posed by Giacento and Roli’s approach. We mean that, once the procedure has 
been applied, it may be possible to improve the MGS design again with other 
post-treatments based on the selection of ‘good’ classifiers for instance. 

Nevertheless, to assess the performance of the proposed procedure, we built 
a large number of weakened decision trees to show that it may not be required 
to grow random forests to significantly improve prediction accuracy. We applied 
the procedure to four multiple classifier systems based on G4.5 decision tree: 
Breiman’s Bagging Ho’s Random Subspaces jll|, their combination in a 
same model labeled ‘Bagfs ’ na and Breiman’s Random forests 0. We assessed 
the procedure’s performances on five large benchmark databases. Indeed, the 
proposed procedure based on the McNemar test is practically useful for huge data 
bases or real-time applications for which it has already been successfully applied. 
It actually allows to reduce memory and time requirements which may be strong 
criteria for the real-world application of MGSs. The experimental results showed 
that the use of the McNemar test enables to limit the number of trees for each 
method significantly. We also observed that the minimal number of trees required 
for maximum accuracy may vary so that a good trade-off between prediction 
accuracy and tree requirements of an MGS may be found. 

The paper is organized as follows. The random forests are described in Sec- 
tion 0 Then the McNemar test of significance and the procedure for limiting the 
numbers of classifiers are explained in Section 0 The data bases to which the 
multiple classifier systems are applied are detailed in Section 0 and the exper- 
imental framework in Section 0 We discuss the results in Section 0 before the 
conclusion (Section 0 and the references. 



2 Random Forests 

To illustrate our idea of limiting the number of classifiers, we selected four ways 
of building weakened decision forests: (1) bootstrap aggregating (‘Bagging’,^ 
cni) (2) Random subspace method (or ‘MFS’ for Multiple Feature Subsets, nm 
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El) (3) the combination of Bagging with Random subspace (‘Bagfs ’> El) and 
(4) Random Forest (‘Bagrf’, j^j). 

Bagging consists of building B bootstrap replicates of an original data set 
and of using these to run a learning algorithm. Ross Quinlan m has validated 
the Bagging method with C4.5 decision tree inducer. 

The Random subspace method consists of training a given number of classi- 
fiers (B), with each having as its input a given proportion of features (fc) picked 
randomly from the original set of / features with or without replacement. Ho 
m proposed this approach for decision trees. Bay |2 applied a very similar 
approach, labeled ‘MFS’, to nearest neighbors. This method was performed here 
by using the original feature set only (i.e. without expanding the feature vector 
with combination functions of features) and by selecting randomly a proportion 
of features without replacement. In the rest of the paper, we will refer to this 
weakening method by the label ‘MFS’. 

We showed on benchmark data bases in m that combining Bagging and 
MFS in the same architecture (‘Bagfs’) could improve prediction accuracy. In 
ca, the Bagfs’ architecture had two levels of decision (A ‘nested’ level for each 
bootstrap between all its MFS and a ‘final’ level between all bootstraps). Here, 
we applied a simpler architecture with only one level of decision (See also ^5). 
We generated B bootstrap replicates of the learning set (The same ones used 
to apply the bagging method). In each replicate we independently sampled a 
subset of f features, randomly selected from amongst the / initial ones without 
replacement (the same ones used to apply ‘MFS’). We denoted k = f'/f as the 
proportion of features in these B subsets. The proposed architecture has thus 
two parameters, B and fc, to be set. 

The proportion of features in each subspace, denoted kgpt in Table EJ of 
MFS and Bagfs was optimized by performing a nested stratified 10-fold cross- 
validation (as more detailed in IS]). It’s worth noticing that we obtained the 
same fopt for both MFS and Bagfs. 

Breiman’s Random forest method (we labeled ‘Bagrf’, 0) consists of creating 
B bootstrap replicates of the learning set. For each replicate, a feature subset 
to split on is randomly selected (without replacement) at each node of the tree. 
According to Breiman’s method, we fixed the size of these random subsets, 
denoted F in Tabled to be the first integer less than log 2 (/) -I- 1, where / is the 
number of features. 

A common feature of all methods is that they combine predictions by means 
of the plurality vote. Moreover, Bagging, MFS and Bagfs can be applied to any 
learning algorithms that are unstable for training modifications (e.g. decision 
trees, artificial neural networks) and feature set modification (e.g. decision trees, 
nearest neighbours) while Bagrf is specific to decision trees. 

We tested each method with respect to Ross Quinlan’s C4.5 decision tree 
Release 8 (^^) with its default parameter values and its pruning method (all 
the decision trees were pruned except for Bagrf, as specified in its original for- 
mulation) . 
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3 McNemar Test of Significance 



3.1 General Background 



In this paper, we use the McNemar test pOllTIliS] as a direct method for testing 
whether two sets of predictions differ significantly among themselves. Given the 
two algorithms A and B, this test compares the number of examples misclassified 
by A, but not by B (labeled Mab), with the number of examples misclassified 
by B, but not by A (labeled Mba)- In the case that Mab + Mba > 20, if the 
null hypothesis Hq is true (i.e.,if there is no difference between the algorithms’ 
predictions), then the statistics (equation [IJ can be considered as following 
an distribution (with 1 degree of freedom). 



^ 2 ^ i\Mab - Mba\ - 



Mab + Mba 



Xl,0.95 



( 1 ) 



The hypothesis Hq is rejected if is greater than Xi o.os = 3.841459 (sig- 
nificance level p < 0.05). In this case, the algorithms have significantly different 
levels of performance. If condition Mab + Mba > 20 is not satisfied, the approxi- 
mation of the statistical distribution cannot be used and the exact test described 
in ca has to be performed. As this happened rarely in our experimental design, 
in these cases, we preferred to accept the hypothesis that the two algorithms 
have the same performance. 

Moreover, different studies (see for instance CCHl) showed that this non- 
parametric test is also preferred to parametric ones (such as the commonly used 
i-test) because no assumption is required and it is independent of any evaluation 
measurement (error rate, kappa degree of agreement, ... ). Dietterich [7| also 
showed that McNemar has a low type I error (the probability of incorrectly 
detecting a difference when no difference exists) and concluded that it is one of 
the more acceptable tests among the most common ones if the algorithms can 
only be executed once. 



3.2 Limiting the Number of Classifiers 

When creating multiple classifier systems such as the random forests described in 
Section 0 we may overproduce an arbitrary large number, B, of voting classifiers. 
In this paper, the question is how to limit the number of classifiers to produce 
while being as accurate as the same MCS combining a larger number of classifiers. 

We applied McNemar test of significance as described in Section ft. I I between 
two sets of predictions from two MCSs that differ only by their number of clas- 
sifiers. Let us denote C a learning set and T = {(x,?/)} a data set independent 
from C. Let Cm = {y = vote{ip^^'> {x, C),k = 1, . . . , m}} be the prediction set of 
m voting classifiers. The classifiers are built so that the classifier predictions 
are diverse and on an equal footing in terms of voting i.e. no classifier is a priori 
better than another with respect to any criterion (e.g. as it is the case here by 
building multiple random decision trees, see Section |3). 
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The proposed procedure consists of comparing the prediction set Cm to C„, 
with n > m, with respect to the McNemar test. Either the set of classifiers used 
to obtain predictions is completely independent from the one that predicts 
Cm, or it contains all or part of the m classifiers that predicts Cm- We showed 
that this does not change our conclusion as it will be detailed in Section 0 
The McNemar test gives an answer d with a significance level p < 0.05: 

{ 1 if Hq rejected and Mmn > Mnm 
— 1 if Hq rejected and Mnm > Mmn 
0 if the two prediction sets do not differ 

If d(m,n) = 1, then we conclude that combining n classifiers gives a higher 
level of performance than combining m classifiers with respect to McNemar test. 
So we should carry on the procedure with a higher number of classifiers than n. 

If d{m,n) = — I, we should stop the procedure and use m classifiers only. It 
may only appear rarely since increasing the number of voting classifiers should 
not degrade the prediction accuracy significantly. As a matter of fact, this case 
never appears in our experiments. 

If d{m, n) = 0, then combining m weakened classifiers does not significantly 
differ from combining n classifiers and we may keep (x, £), k m} 

as the multiple classifier system with the minimal number of classifiers, m* = m, 
that limits the number of classifiers to combine. 

4 Material 

We applied Bagging, MFS, Bagfs and Bagrf to 5 large data bases (see Table QJ. 
Four of these were downloaded from the UCI Machine Learning repository |3|, 
i.e. satimage, image segmentation (‘image’), letter and DNA. We also included 
the artificial data base ‘ringnorm’ used by Breiman in [S|. All these data bases 
have no missing values. Notice that for DNA, we gave Bagrf’s parameter F a 
higher value since the one obtained by the original computation (F = 7, see 
Section 0 led to a low prediction accuracy. 



Table 1. Databases used to perform the classification tasks. 



Data set Learning #Feat. ^ Classes fopt F 
Set Size Cont /nominal 



ringnorm * 


7400 


20/0 


2 


8 


5 


satimage 


6435 


36/0 


6 


18 


6 


image 


2310 


18/0 


7 


7 


5 


DNA 


3186 


0/60 


3 


36 


20 


letter * 


20000 


16/0 


26 


7 


5 



* : databases where the examples are equi-distributed across the classes. 
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5 Experimental Design 

In the present paper, we investigated the benefit of using the McNemar test of 
significance as described in Section 0 to determine m* , the minimum number 
of classifiers to combine for a given multiple classifier system on a given data 
set. We illustrated this procedure on four multiple classifier systems described 
in Section 0, namely Bagging, Random subspaces (‘MFS’), Bagfs and another 
Random forest, ‘Bagrf’, applied to five data bases (as detailed in Section^. For 
each of these MCSs, we overproduced B = 200 weakened decision trees. We split 
each data base in 3 stratified folds, a learning set £, a validation set V and a 
testing set T. Evaluations and comparisons of the MCSs were made on the basis 
of these 3 folds and we validated our approach by permuting the role played by 
each fold. C is used as the learning set to build 200 weakened decision trees. 
V is used to apply the McNemar test between the prediction set resulting from 
the vote of m classifiers (m = 1 . . . 200) and the prediction set of the vote of n 
classifiers {n > m). Finally, we kept T for testing independently the procedure 
predictions. 

Using the validation set V, we obtained the table Dy = dy{m,n), m,n = 
1, . . . , 200 (subscript v is used for results carried out on the basis of the validation 
set V). Once Dy is so computed, we extracted the recommended m* as explained 
in Section to Then the remaining data set, T, is used to determine D = d{m, n) 
and extract m* in order to assess the proposed procedure on the independent 
testing set. For each classification task and each MCS, we are then able to 
compare m* to to* and thus to appreciate the quality of the predicted value to*. 

6 Results and Discussion 

On Figured for each method and each data base, each dot represents d{m,n), 
the result of McNemar test that compares the prediction set of a TO-classifier 
system (on a row) with the prediction of a n-classifier system (on a column) 
(to,7i = 1..200). Each figure is symmetrical and composed of a bright and dark 
region. The dark region means that the compared architectures differ signifi- 
cantly with respect to McNemar. The bright region means that the compared 
architectures do not differ significantly. These results showed that a threshold 
appeared distinctly between the two regions ‘differ’ or ‘differ not significantly’. 
So the proposed procedure based on McNemar test led to the determination of 
TO* , a significantly lower number of classifiers on most data bases than the total 
of 200 classifiers overproduced for each MCS. 

Table 121 shows the results’ summary of the experimental design for each mul- 
tiple classifier system and each data base. This table indicates in bold and in 
brackets each to* computed as detailed above. We also give the percentage of 
good classification obtained with each to* to compare the performance of each 
MCS on the same data base. 

The results obtained on the remaining independent testing set from the global 
data base, to*, showed that this number was always close to the predicted value 
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Fig. 1. Experimental results. Influence of the number of trees on the predictions 
with respect to McNemar test 
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m* (most of the time equal or even lower). In the too optimistic but rare cases 
where m* < m*, we observed that the difference was never larger than 10. 
Nevertheless, the results showed that by performing this procedure, we obtained 
a drastic decrease of the number of classifiers required to obtain the same level of 
performance than MCSs combining 200 classifiers. We also observed in Table 0 
that 

— Bagfs systematically exhibited a better prediction accuracy and a lower m* 
than Bagrf with respect to McNemar test. 

— By increasing the number of classifiers, Bagfs always exhibited significantly 
better performance than Bagging with respect to McNemar test. 

— To obtain the same level of accuracy than MFS, Bagfs required less classifiers 
on satimage and image. On the other data bases, Bagfs exhibited significantly 
better results than MFS (with respect to McNemar) but it required more 
classifiers. 



Table 2. Experimental Results. Performance in terms of the prediction accuracy 
(%), minimum recommended number of trees with respect to McNemar in bold and in 
brackets. 



C4.5 Bag MFS Bagfs Bagrf 



ringnorm 


89.3 


94.0 


97.7 


98.4 


96.2 






(10) 


(30) 


(50) 


(60) 


satimage 


84.0 


88.2 


89.8 


89.6 


89.1 






(20) 


(70) 


(50) 


(50) 


image 


93.6 


96.1 


96.2 


96.0 


94.5 






(10) 


(50) 


(10) 


(40) 


DNA 


86.5 


89.5 


90.0 


91.5 


89.8 






(20) 


(20) 


(30) 


(130) 


letter 


81.4 


88.6 


90.4 


91.6 


89.0 






(90) 


(50) 


(110) 


(200) 



In the present paper, we systematically overproduced classifiers (200) to assess 
the method’s performance. The results obtained on each data base with each 
MCS let us suggest that we could incrementally increase the number of classifiers 
by step of 10 classifiers and perform the direct test of McNemar at each step 
instead. Furthermore, this approach of the MCS’ design would combine a limited 
number of classifiers, m*, (i.e. predicted on a reduced validation set independent 
from the learning set) without any significant loss of accuracy and applied to a 
large data set of unknown cases. This question is especially interesting for huge 
data bases and real-time applications working on other base learning algorithms 
slower than decision trees (e.g. neural networks) to obtain a gain in time and 
memory costs. 
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7 Conclusion 

We suggested a simple procedure based on the direct test of McNemar to limit 
the number of classifiers to combine in a multiple classifier system. The procedure 
compares the set of predictions of a MCS with a given number of classifers with 
the prediction set of the same MCS with a higher number of classifiers. If the 
prediction sets do not differ with respect to McNemar test, we concluded that 
the smallest number of classifiers is enough to obtain the same level of accuracy 
with respect to McNemar test. 

Experimental results showed on four different MCSs applied to C4.5 decision 
tree and cross-validations on five large benchmark data bases, that it may be 
possible to select a priori a minimum number of classifiers which, once combined 
with the plurality voting rule, offered the same level of performance than larger 
numbers of trees with respect to the McNemar test. Moreover, we showed that 
a sharp threshold appeared between the region where the prediction sets ‘differ’ 
and the one where the prediction sets ‘do not differ significantly’ with respect 
to McNemar test. 

Furthermore, we suggested a way to improve the design of a MCS without 
overproducing classifiers. It consists of incrementally adding new classifiers to the 
existing ensemble and comparing by means of a cross-validation the predictions 
of the resulting ensemble to the one with less classifiers. 

Finally, we proposed a simple approach in this paper to improve the design of 
a multiple classifier system that consisted of limiting the number of classifiers to 
combine without a loss of prediction accuracy (with respect to a direct statistical 
test of comparison, McNemar) but with a gain in memory and time costs that 
may be significant for huge data bases and real-time applications. 
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Abstract. In this paper we propose a model of neural networks en- 
semble consisting of a number of MLPs, that deals with an imperfect 
learning supervisor that occasionally produces incorrect teacher signals. 
It is known that a conventional unitary neural network will not learn 
optimally from this kind of supervisor. We consider that the imperfect 
supervisor generates two kinds of input-output relations, the correct re- 
lation and the incorrect one. The learning characteristics of the proposed 
model allows the ensemble to automatically train one of its members to 
learn only from the correct input-output relation, producing a neural 
network that can to some extent tolerate the imperfection of the super- 
visor. 



1 Introduction 



In recent years a number of multi-net models have been introduced One of 
them is the Mixture of Experts (ME) |2l,‘ll4IRI6'j . which is a combination of a 
number of Multi-layer Perceptrons (MLP) with a gating network. The objec- 
tive of this model is to decompose a given task, and allocate different experts 
to deal with different sub-tasks. ME successfully solved problems that are dif- 
ficult for unitary neural networks. An ensemble model with similar objective 
was also proposed more recently 170. Unlike ME, there are also neural net- 
works ensemble models that train their modules on the whole task and combine 
the outputs to achieve better generalization. The methods of generating neural 
networks ensemble and their advantages compared to a unitary neural network 
have been explained in All of the multi-net models described 

above showed better performance than conventional unitary neural networks, 
but they were designed based on the assumption that the tasks/environments 
that have to be solved are stationary, which means that the dynamics that reg- 
ulate the input and output are fixed over time. In real world problems, there 
are possibilities that the neural network has to deal with data originated from 
different sources with different dynamics. The multi-net models described above 
will fail to deal with this kind of ’’switching dynamics” problem because it is 
impossible for them to map the same inputs to different outputs. To deal with 
a switching dynamics problem, an ensemble model that can train each of the 
members to recognize a particular dynamic through competition was proposed 
in We have also proposed a model of neural networks ensemble with 
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a switching mechanism that allows simultaneous learning for all the ensemble’s 
members while automatically selecting one member that is considered to be the 
most suitable for a particular input-output relation (environment) and allocating 
different members for different environments im. In this paper we applied 
the proposed ensemble model in a situation where the neural network has to 
learn from an imperfect supervisor that occasionally produces erroneous teacher 
signals. An imperfect supervisor is a supervisor which stochastically produces 
two kinds of input-output relations (dynamics), the correct input-output rela- 
tion and the erroneous one. We are dealing with a condition in which during the 
learning process the supervisor adopts ’’the correct dynamics” most of the time 
but occasionally switches its dynamics to erroneous ones. Because our proposed 
ensemble can train each of its members with different input-output relations, it 
can be expected that one of the ensemble’s members will learn the correct input- 
output relation while the erroneous one will be absorbed by others, so that a 
neural network that is not contaminated by the incorrect data produced by the 
imperfect supervisor can be generated. 



2 Problem Specification 

Suppose we have a training set, 

f? = {(A(l),d(l)),...,(A(n),d(n))} (1) 

X G is the input vector and d G {0, is the desired output vector. Wn, 
Nout are the dimensions of the input and output vectors, respectively. In this 
paper we considered an imperfect supervisor who occasionally gives erroneous 
desired output with a certain probability. The behavior can be written as 



P(d*''“"(*)|A(i)) = 1-e 

P(d‘-(*)|A(z)) = e (2) 

0 < e < 1, 

where P{d\x) is the conditional probability that the supervisor produces d as 
teacher signal for input X, and e is the error rate. and are the correct 

and erroneous teacher signal respectively. 

In this paper we are dealing with problems that require binary outputs, so 
the erroneous teacher signal is not cause by additive noise to the correct one, 
but by misclassification of the supervisor. We focus our attention to classification 
problem where the teacher signal is represented by a set of bits, only one of which 
has the value of 1 to denote a certain class. The erroneous teacher signal is a 
teacher signal that relates a given input to an incorrect class. 

Our objective is to train a neural network that can achieve. 



y{X{i)) ^ d‘”-(A(*)) 



(3) 
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Fig. 1. Neural networks ensemble 



y{X(i)) denotes the output of the neural network given input X(i). However, 
presented with an imperfect supervisor defined above, a conventional unitary 
neural network will produce output according to Eq. 0 which does not satisfy 
our objective. 



y{x{i)) Ri E[d\X{i)] 

E[d\X{i)] = (1 - (4) 

where E[d\X{i)] denotes the expectation of teacher signal d given input X. 



3 Neural Networks Ensemble 

The structure of the proposed ensemble is shown in Fig.l. The ensemble consists 
of a number of independent multi-layer perceptrons (MLP), which we call mem- 
bers. Each member has an identical number of input neurons and output neu- 
rons, but because we want each member to specialize on different input-output 
relations , they are diversified by setting different numbers of middle neurons 
for each of them. In the learning process, an input coming into the ensemble 
is directed to the input layers of all members and simultaneously and indepen- 
dently processed by them. Each member will automatically decide whether to 
learn from the training set or to discard it. The data selection through compe- 
tition between the members is governed by the temperature control mechanism 
explained in Section 4. 

The output of the neurons in the middle layer is described as follows. 
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( 5 ) 



where, and are the output, potential and threshold value of 

the m-th middle neuron in the i-th member, respectively, is the connection 

weight between the n-th neuron in the input layer and the m-th neuron in the 
middle layer of the i-th member, Xn is the n-th element of the input vector, while 
Nin is the dimension of the input vector. 

The output of the neuron in the output layer is described as follows. 



i.out 

Vk = 



vp{-\-) 



i,out - uout i^rnid 



m—1 



k Vr) 



( 6 ) 



where are the output, potential and threshold value of the fc-th 

member, respectively, is the connection weight between the m-th neuron 

in the middle layer and fc-th neuron in the output layer of the i-th. member, while 
Nlriid is tii6 number of middle neurons in the i-th member. Ti is the temperature 
of the i-th member. 

It is clear that when Ti is high, the output neurons in the i-th member will 
always produce responses close to 0.5 regardless of their potentials. Because for 
problems requiring binary answers, 0.5 is a value without any significance, a 
member with high temperature is defined as an inactive member. 

Adopting backpropagation learning method m for each member, the weight 
correction between the m-th middle neuron and the fc-th output neuron of the 
i-th member, can be written as, 

^ i,midcout 
d i 

srout /j i,out\ i.out i,out\ 

Sk = {dk - Vk )yk (1 - ) 

dk is the k-th element of the teacher signal. From Eq.d, if temperature Ti is suf- 
ficiently large, then the weight correction will be insignificant, and consequently 
the weight corrections between neurons in the input layer and ones in the middle 
layers will be insignificant, because the correction can be written as follows. 
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From Eqs. 0 and |H| we can see that if a member decides to discard a training 
pattern, then it can increase its temperature so that the member will not be 
influenced by the training pattern. This is the basic idea on how the ensemble 
can produce a member that selects only the correct learning data, thus minimiz- 
ing the effect of an imperfect supervisor. The temperature control mechanism 
controls the temperatures of all members based on their performance for a given 
learning pattern. 

The correction of connection weight vector in the i-th. member is expressed 
as follows. 



W\t -h 1) = W\t) + r]AW\t) + nAW\t - 1) (9) 

and are the weight vector and correction vector of the i-th 

member at time t, respectively. 77 and k are the learning rate and momentum, 
respectively. 

Because we focus on binary problems, the continuous output of each member 
should be rounded by a stepping function defined as follows, 

s{x) = 1 x>0.5 + P (10) 

= 0.5 for 0.5- f3 <x <0.5 + f3 
— 0 for X < 0.5 — P 

The value of 0.5 is adopted in the stepping function because we consider 
values in the vicinity of 0.5 as ambiguous for binary problems, they can be 
interpreted as ’’don’t know” in classification problems. 



4 Temperature Control Mechanism 



The temperature control mechanism is introduced to enable the ensemble to 
allocate a given training pattern to the most appropriate member. The basic 
idea is to reward members that perform relatively well with respect to the given 
training pattern by decreasing their temperatures, so that they may learn fur- 
ther from the training pattern, and to punish members that perform badly by 
increasing their temperatures. Adopting this idea, from a particular member’s 
point of view, that member is allowed to select training data that suit its spe- 
cialty. The performance of the i-th member is measured by its relative error r* 
defines as. 



r = 






i,out \2 



E m \ — '•Nout / j 7 - 

i=i J2k=i idk - yi 



j,out)2 ' 



( 11 ) 



where dk, yl’°^*, M and Nout are the k-th element of the teacher signal, response 
of the fc-th neuron in the output layer of the i-th member, the number of members 
in the ensemble and the number of output neurons which is shared by all the 
members, respectively. A small relative error shows the good performance of 
the member, and a big relative error indicates the opposite. Furthermore, it 
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is favorable that a member with the lowest temperature dominates the rest, 
so a member that performs well should also punish the others by increasing 
their temperatures, and members with low performances should surrender their 
rights to learn by decreasing others’ temperatures. The temperature correction 
is executed for every learning iteration according to. 



T,{t + 1) = T,{t) + AT,{t) - C (12) 

M 

AT,{t) = - Mr*) -h ^(1 - Mr^), 

jA 

where and C are positive self-penalty, cross-penalty and cooldown 

constants, respectively. 

The first term in Eq.^jis the self-penalty term, which will increase a partic- 
ular member’s temperature if the performance of that member is below average, 
and decrease it if the performance is better than average. The second term is the 
cross-penalty term, which is the cumulative penalty or reward from other mem- 
bers. When all the members have identical performances ( no particular winner), 
learning chances have to be given to all members. In this case, the cool-down 
constant acts as a catalyst to prevent deadlock in the learning competition. 

We limit temperature value between Tmin and Tmax according to, 



Ti(t +1) — Tmax if Ti{t) + ATi{t) — C > Tmax (13) 

Ti{t -|- 1) = Tmin if Ti(t) + ATi(t) — C < Tmin 
0 ^ Tmin ^ Tmax 

For binary problems, it is clear that a member that performed well regarding 
the correct input-output relation of the supervisor, will perform badly whenever 
the supervisor gives an erroneous learning pattern, causing its temperature to 
rise, blocking the member from learning that training pattern. The temperature 
control mechanism allows each member to automatically select learning-data 
that are relevant to the member’s expertise. The characteristic of the imperfect 
supervisor shown in Eq. El ensures that although the supervisor is imperfect, in 
general it generates correct teacher signals, so at the end of the learning process 
we can extract a winner from the ensemble by selecting the member that is 
active most of the time, because it is a member that benefits most from the 
learning process. This is done by selecting a member that has the lowest average 
temperature over the learning process. 

5 Experiment 

In this experiment, the proposed ensemble model is applied to Iris Classification 
problem ED. In this problem, the neural network has to classify an Iris flower 
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to one of three classes (Iris-Setosa, Iris- Versicolor, Iris-Virginica), based on the 
length and width of the flower’s sepal and petal. 

In this experiment, each member of the ensemble has 4 input neurons and 
3 output neurons. In the output layer, Iris-Setosa is represented by ”100”, Iris- 
Versicolor by ”010” and Iris-Virginica by ”001” 

An error in teacher signal is generated by flipping the ”1” bit to ”0” and 
randomly flipping one of the two ” 0” bits to ” 1” . 

We used a three-membered ensemble, with 8,9,10 middle neurons respec- 
tively. The parameter settings for this experiment are shown in Table 1. 



Table 1. Parameter Settings 



Parameter 


Value 


Learning Rate 


0.3 


Momentum 


0.1 


Self-Penalty 


100 


Cross-Penalty 


10 


Cooldown 


30 


Tmax 


200 


Tmin 


1 



60 Iris data sets are provided (20 for each respective class), learning iteration 
is set to 20000 in which a training pattern is chosen randomly each time. After 
the learning process, the winner is tested using 60 data that were not used 
in the learning process to evaluate its classification accuracy. The accuracy of 
the ensemble’s winner (represented by ’’win”) when the supervisor is imperfect 
is shown in Fig. 2. For comparison, we also tested each member’s accuracy, 
provided that the member was trained independently (represented by ”sgl8”, 
”sgl9”,”sgll0”). The accuracy of a particular member, A is defined as follows. 






A = i-—j2d^no{x{i)),d* 

■ 1 

^ 1—1 

0{X{i)) = s(j/°“‘(A(*))) 
dif{a, b) = 1 for a 

= 0 otherwise, 






(14) 



where y°“*(A(i)) and d‘”“®(A(i)) are the output vector of the neural network 
and the correction teacher signal given input X(i), respectively. Np is the number 
of test patterns. 

We also compared the classification accuracy of the ensemble’s winner with 
another ensemble system which averages the independently trained members’ 
outputs as follows. 
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Fig. 2. Performance Comparison (Winner-Unitary MLPs) 



i=i 

Qavr jg output of the average system(shown by ”avr” in Fig. 3), is the 
output of the j-th member, s() is the stepping function defined in Eq. II 1)1 a,nd 
M shows the number of members. Figs. 2 and 3 show that the winner classifies 
better in wider range of error rate compared with conventional unitary MLPs 
and average system. It will be reasonable to use the proposed ensemble model 
when we do not have any information concerning the supervisor’s reliability. 




Fig. 3. Performance Comparison (Winner-Average system) 



Figure 4 shows the temperatures of the ensemble’s members during the train- 
ing process. The left graph shows the members’ temperatures when the error rate 
is 0%, in which the temperature of the member with 10 middle neurons converged 
to the Tmin after a number of learning iterations while the temperatures of the 
rest of the members converged to T^ax, while the right one shows the members’ 
temperatures when the error rate is 10%. This implies that the member with 
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10 middle neurons dominated the other members in the learning process and 
should be selected as a winner in the end of the learning process. In the right 
graph in which the error rate is 10%, the temperature of the winner fluctuates 
around the minimum temperature while the others’ fluctuate around a much 
higher temperature. This fluctuation is caused by the erroneous data generated 
by the supervisor which forces the winner to occasionally pass its domination to 
the other members by increasing its temperature and decreasing others’. 




r 








jrror 








, 


• 
















(i'r'f 


m 

' ;''i| 


lif 




'}i\ 


i' i 'i 


, , , 


if; 'll 


I'lij 


1! I 


fl 










^ '1 




l| 


j 


-4- 




















li 


ijfi 




mid:8 

mid:9 

midlO(win) 










- i!i! 


'if 
































iiil I 




















% 


J, 



















0 20 40 60 80 100 120 140 160 180 200 

iteration (x 100) 



Fig. 4. Temperature during training 



6 Conclusion and Future Works 

We have proposed a model of neural networks ensemble that allows its members 
to automatically select learning data. The characteristics of the temperature 
control mechanism allow us to obtain a member that learns only from the cor- 
rect learning data generated by an imperfect supervisor, assuming that in general 
the supervisor behaves correctly. The ability of automatic selection of the correct 
learning data is useful, because the strict requirement of having to provide a neu- 
ral network with perfect training data can be eased, implying that we have more 
freedom in designing learning data. The toleration of the existence of erroneous 
learning data to some extent will support the broader usage of neural networks 
in real world problems in which the help of human experts for data selection is 
not cost-efficient, or in complicated problems where even human experts commit 
errors. In the future we are considering the application of the proposed ensem- 
ble in on-line training system, fault-tolerant systems, medical diagnosis support 
systems, etc. In this paper our focus was on binary problems, in the future we 
plan to refine the ensemble model so that it can deal with regression problems. 
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Abstract. It has been shown by several researchers that multi-classifier 
systems can result in effective solutions to difficult tasks. In this 
work, we propose a multi-classifier system based on both supervised 
and unsupervised learning. According to the principle of “divide-and- 
conquer”, the input space is partitioned into overlapping subspaces and 
Support Vector Machines (SVMs) are subsequently used to solve the 
respective classification subtasks. Finally, the decisions of the individual 
SVMs are appropriately combined to obtain the final classification 
decision. We used the Fuzzy c-means (FCM) method for input space 
partitioning and we considered a scheme for combining the decisions of 
the SVMs based on a probabilistic interpretation. Compared to single 
SVMs, the multi-SVM classification system exhibits promising accuracy 
performance on well-known data sets. 



1 Introduction 



Multi-classifier systems have been recently used with great success in difficult 
pattern recognition problems. A major issue in the design of multiple classifier 
systems concerns whether individual learners are correlated or independent. The 
first alternative is usually applied to multistage approaches (such as boosting 
techniques 111213141 b whereby specialized classifiers are serially constructed to 
deal with data points missclassified in previous stages. In particular, the ap- 
proach described in ^ concerns the application of boosting to speed up the 
training of Support Vector Machines (SVMs). The second alternative advocates 
the idea of using a committee of classifiers which are trained independently (in 
parallel) on the available training patterns, and combining their decisions to pro- 
duce the final decision of the system. The latter combination can be based on 
two general strategies, namely selection or fusion. In the case of selection, one or 
more classifiers are nominated “local experts” in some region of the feature space 
(which is appropriately divided into regions), based on their classification “ex- 
pertise” in that region [3, whereas fusion assumes that all classifiers have equal 
expertise over the whole feature space. A variety of techniques have been applied 
to implement classifier fusion by combining the outputs of multiple classifiers jOl 
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In the case of multiple independent classifiers, several schemes can be adopted 
regarding the generation of appropriate training sets. The whole set can be used 
by all classifiers PU2I or multiple versions can be formed as bootstrap replicates 
m- Another approach is to partition the training set into smaller disjoint subsets 
but with proportional distribution of examples of each class mm- 

The present work follows a different approach based on partitioning of the 
original training data set into subsets and the subsequent use of individual clas- 
sifiers for solving the respective learning subtasks. A key feature of the method 
is that the training subsets represent non-disjoint regions that result from input- 
space clustering. Thus, SVMs are assigned to overlapping regions from the be- 
ginning and acquire their specialization through training with data sets that 
are representative of the regions. This partitioning approach produces a set of 
correlated “specialized” classifiers which attack a complex problem by applying 
the divide-and-conquer principle. 

In the next section, we address the issue of data partitioning based on unsu- 
pervised learning with the fuzzy c-means method. Section 3 describes the multi- 
classifier system and the scheme for combining SVMs decisions. Experimental 
results for the evaluation of the proposed method are presented in Section 4 and 
conclusions are presented in Section 5. 



2 Partitioning of the Data Set 

Consider a data set D having N patterns x* where iP S i = 1, . . . , N . The 
first stage of the proposed classification technique consists of partitioning the 
original data set D = {x^, . . . , x'^} using clustering techniques to identify natural 
groupings. As a result of clustering, a number of training subsets £>i, £> 2 , ■ . ■ , Dm 
are generated from the set D. The clustering technique tested in this work is 
briefly described in the following subsection. 



2.1 Fuzzy C-Means Clustering 

Fuzzy c-means (FCM) is a data clustering technique in which a data sample 
belongs to all clusters with a membership degree. FCM partitions the data set 
into M fuzzy clusters (where M is specified in advance), and provides the center 
of each cluster. Clustering is usually based on the Euclidean distance: 

d 

( 1 ) 

i=i 



where x £ 
The FCM 
belongs to 
0 and 1: 



R‘^ is a training sample and fl G R‘^ corresponds 
algorithm provides fuzzy partitioning, so that a 
cluster j (with center flj) with membership degree 



E m 






d{x,p,k) 



to a cluster center, 
given data point x 
Uj varying between 



( 2 ) 
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The membership degrees are normalized in the sense that, for every pattern, 

M 

i=i 

Starting from arbitrary initial positions for cluster centers, and by iteratively 
updating cluster centers and membership degrees using e.q. JQ and © for each 
training point al*, i = 1, . . . , the algorithm moves the cluster centers to sensible 
locations within the data set. This iteration is based on minimizing an objective 
function J that represents the distance from any given data point to a cluster 
center weighted by the data point’s membership degree. 

N M 

J(/Ii,...,/2m) = , (4) 

i=i 

where m G [1, oo) is a weighting exponent. 

The main drawbacks of this algorithm is that its performance depends on the 
initial cluster centers and that the number of clusters is predefined by the user. 
Therefore, it is required to run the FCM algorithm several times, each time with 
a different number of clusters to discover the number of clusters that results in 
best performance of the classification system. 

2.2 Training Sets Generation 

Following fuzzy clustering we can specify the degree (varying between 0 and 1) 
with which a data point belongs to each cluster. Let x be an input data point 
with its corresponding membership degree Uj to cluster j. To create M non- 
disjoint training sets corresponding to the M clusters we perform the following 
steps for each data point x: 

1. If uj > Uj , V J = 1, . . . , M, then the data point x is assigned to the training 
set Dj. 

2. For every j = 1, M , j ^ J, a random number q is generated according 
to a uniform distribution in the interval (0,1) and the data point x is assigned 
to the training set Dj ii q < Uj. 

Therefore, the data point x is assigned deterministically to the training set corre- 
sponding to the cluster with maximum membership for that point and is assigned 
probabilistically to each of the remaining training sets (with probability equal 
to the degree of membership to the respective cluster). 

Figure n displays non-disjoint training sets that result from the fuzzy c-means 
clustering method applied to the well-known Clouds data set considering three 
clusters. The three cluster centers are represented as big circles and the patterns 
of each training set are represented as crosses, circles and stars respectively. 
We can also observe a degree of overlapping between the training sets, as some 
patterns belong to two or three training sets simultaneously. The correlation 
between the data sets has a beneficial impact increasing the robustness of the 
multi-SVM classification system. 
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Fig. 1. Training sets corresponding to three-cluster partition of the Clouds data set 
using the fuzzy c-means algorithm. 



3 The Multi-classifier System 

Support Vector Machines (SVMs) is a learning paradigm based on the work of 
V. Vapnick [El and his team (AT&T Bell Labs). The support vector algorithm 
applies the Structural Risk Minimization principle to construct rules that exhibit 
good generalization abilities. In doing so, they extract a small subset of the 
training data called the “support vectors” libllTliSIfijj . 

In what concerns the classification modules of the proposed multi-classifier 
system, the primary idea is to train a Support Vector Machine for each group 
of patterns Dj generated through the partitioning of the original data set D. In 
this sense, each classifier learns a subspace of the problem domain and becomes 
a “local expert” for the corresponding subdomain. 



3.1 Training of Individual SVMs 

The idea of a support vector machine is based on the following two operations: 

1. Nonlinear mapping of an input vector into a high-dimensional feature space 
that is hidden from both the input and the output. 

2. Construction of an optimal hyperplane for separating the above features. 
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For details on the SVM approach we refer to the literature. In the following we 
briefly discuss the options adopted in our implementation. 

Each individual classifier has been implemented as a Support Vector Machine 
using radial basis functions as the nonlinear transformations from the input space 
to the feature space: 



f{x) = sign 





( 5 ) 



Here cr is a width parameter defined a priori and at ,i = 1, . . . , N are parame- 
ters (Lagrange multipliers) determined optimally by the SVM algorithm, which 
permits us to construct a decision surface that is nonlinear in the input space, 
but its image is linear in the feature space. For our experiments we used a simple 
easy-to-use package for SVM classification: OSU SVM Toolbox 2.00 pD] . 

As already mentioned, an important advantage of the multi-classifier method 
is that the training of each SVM can be done separately and in parallel. Thus, 
in the case of a parallel implementation, the total training time of the system 
equals the worst training time achieved among the SVMs. It must be noted that 
this total training time cannot be greater than the training time of a single SVM 
classifier dealing with the entire training set. 



3.2 Combination of Decisions 

As described above, the original training data set D is partitioned into M (non- 
disjoint) subsets, and M classifiers are trained, one for each subset. Consider a 
new input vector x which belongs to one of c classes. Given the vector x, a class 
label Cj ,j = 1, . . . , M, is produced by each SVMj, and the membership degree 
Uj of the vector x to the respective cluster j is computed. To obtain the final 
classification decision, the decisions of the individual SVMs are combined in a 
probabilistic way. 

A usual approach to obtain the classification of x is to compute the probabil- 
ity P{k I x) (A: = 1, . . . , c) that pattern x belongs to class k and select the class 
C with the maximum P(C | x) as the final decision following the Bayes rule. 

We have adopted the latter approach, where the probability P{k \ x) is 
computed as follows: 

M 

P{k\x)^Y.^,I{C,=k) ( 6 ) 

f=i 

Here I{z) is an indicator function, i.e. I(z) = 1 if z = true, otherwise I{z) = 0. 
The above equation states that the class probability P{k \ x) results as the 
sum of the weights Uj of the classifiers that suggest class k. It is easy to check 
that I ^) = 1- It must be noted that the combination method (0 is 

general since it considers the class label suggested by each classifier and not the 
numerical output vectors. Consequently, the method can also be used with other 
types of classifiers, eg. decision trees or neural networks. 
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4 Experimental Results 

In this section, we present performance results from the use of the proposed 
classification system using the FCM algorithm for data partitioning and the 
previously described scheme for combining classifier decisions. Four well-known 
data sets were used in our experiments, as shown in Table ^ 



Table 1. Summary of the data sets. 



Dataset 


Cases 


Classes 


Featu 

Continuous 


res 

Discrete 


Clouds 


5000 


2 


2 


- 


Diabetes 


768 


2 


9 


- 


Segmentation 


2310 


7 


19 


- 


Phoneme 


5404 


2 


5 


- 



For each data set, ten experiments were performed with random splits of the 
data into training and test sets of fixed sized. The min, mean and max errors were 
calculated from these ten trials. For each experiment, each individual SVM was 
trained several times with different values for a and the regularization parameter 
C of the SVM algorithm. The best outcome of the trials according to the training 
error and the number of support vectors was used when testing the combination 
scheme. The obtained results show that the proposed multi-SVM classification 
system outperforms several methods reported in the literature for the Clouds 
data set |7I21| . the Diabetes data set |1I2I8C2I2^ . the Segmentation data set 0 
and the Phoneme data set It should be noted however, that since 

the partitioning of the data may or may not be the same as in our case, this 
comparison should be considered as rather indicative. 



4.1 The Clouds Data 

The Clouds artificial data from the ELENA project mi are two-dimensional 
with two a priori equally probable classes. There are 5000 examples in the data 
set, 2500 in each class (50%). The theoretical error is 9.66%. 



Table 2. The Clouds data set: Test set error (%) comparative results. 



Clouds data set 


Classifier 


min 


mean 


max 


Multi-SVM 


9.4 


9.99 


10.9 


SVM 


9.9 


11.02 


11.9 
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In our experiments, we used 4000 patterns for training and 1000 patterns for 
testing the system, respectively. For the FCM algorithm, we obtained the best 
results by splitting the original training data set into three subsets. As a result 
of clustering, three SVM classifiers combined in our system. The testing results 
for the multi-SVM classification system are shown in Table |21 For comparison, 
results from using a single SVM classifier are also shown in Table |21 

The classification error obtained with the multi-SVM system is quite close to 
the theoretical one; therefore, any further improvement can hardly be achieved. 



4.2 The Diabetes Data 

The Diabetes set from the UCI data set repository m contains 768 8- 
dimensional data belonging to two classes. In our experiments, we used 600 
patterns for training and 168 patterns for testing the system. For the FCM algo- 
rithm, we obtained the best results by splitting the original training data set into 
three subsets. As a result of clustering, three SVM classifiers combined in our 
system. The testing results for the multi-SVM classification system are shown 
in Table 0 For comparison, results from using a single SVM classifier are also 
shown in Table 0 



Table 3. The Diabetes data set: Test set error (%) comparative results. 



Diabetes data set 


Classifier 


min 


mean 


max 


Multi-SVM 


16.67 


21.13 


25.6 


SVM 


17.26 


22.86 


26.79 



It must also be noted that this data set contains some known outliers, that af- 
fect the construction of the clusters and eventually the classification performance 
of the system. 



4.3 The Image Segmentation Data 

The Image Segmentation data set from the UCI data set repository contains 
2310 19-dimensional examples belonging to 7 classes. We used 1500 patterns for 
training and 810 patterns for testing the system. For the FCM algorithm, we 
obtained the best results by splitting the original training data set into four 
subsets. As a result of clustering, four SVM classifiers combined in our system. 
The testing results for the multi-SVM classification system are shown in Table 
0 For comparison, results from using a single SVM classifier are also shown in 
Table 0 

In our experiments, we preprocessed the Image Segmentation data set by 
applying principal component analysis (PCA). In addition, the size of the input 
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Table 4. The Segmentation data set: Test set error (%) comparative results. 



Segmentation data set 


Classifier 


min 


mean 


max 


Multi-SVM 


6.3 


7.1 


8.02 


SVM 


6.67 


7.38 


8.15 



vectors was reduced to a 7-dimensional space by retaining only those compo- 
nents which contribute more than a specified fraction (defined 0.009) of the 
total variation in the data set. 

4.4 The Phoneme Data 

The Phoneme dataset from the ELENA project contains 5404 5-dimensional 
data belonging to two classes. In our experiments, we used 4500 patterns for 
training and 904 patterns for testing the system. For the FCM algorithm, we 
obtained the best results by splitting the original training data set into two 
subsets. As a result of clustering, two SVM classifiers combined in our system. 
The testing results for the multi-SVM classification system are shown in Table 
El For comparison, results from using a single SVM classifier are also shown in 
Table El 

Table 5. The Phoneme data set: Test set error (%) comparative results. 



Phoneme data set 


Classifier 


min 


mean 


max 


Multi-SVM 


8.85 


9.73 


10.4 


SVM 


8.63 


10.25 


10.95 



5 Conclusions 

In this work, we present and test a multi-SVM classification system that is based 
on both unsupervised and supervised learning methods. To build the classifica- 
tion system, first the original training set is divided into overlapping subsets by 
applying a clustering technique. Then, an individual SVM is trained on every 
defined subset. To obtain the classification of a new pattern, the decisions of 
the SVMs are appropriately combined. An important strength of the proposed 
classification approach is that it does not depend on the type of the classifier, 
therefore, it is quite general and applicable to a wide class of models including 
neural networks and other classification techniques. The learning method offers 
the advantages of the “divide- and- conquer” framework, i.e., smaller classification 
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models may be employed that can be trained in parallel on smaller (and usually 
easier to discriminate) training sets. 

We have applied the fuzzy c-means algorithm for data clustering and consid- 
ered a combination of the decisions of multiple SVMs based on a probabilistic 
interpretation. The resulting approach has been tested on different benchmark 
data sets exhibiting very promising performance. The main conclusion that can 
be drawn from the experimental results is that, as expected, the multi-SVM sys- 
tem exhibits better performance between 0.3% and 1.7% on (four) datasets of 
different size and number of classes than a single SVM classifier. An important 
result that occasionally came up during data partitioning in our study, was the 
creation of training sets with examples of a single class. Thus, there was no need 
of training a classifier for these data sets. 

The multi-classifier methodology implemented in this work is quite general 
allowing the implementation and testing of other techniques both in the cluster- 
ing and the classification module. 
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Abstract. Mammography is a not invasive diagnostic technique widely used for 
early detection of breast cancer. One of the main indicants of cancer is the 
presence of microcalcifications, i.e. small calcium accumulations, often 
grouped into clusters. Automatic detection and recognition of malignant 
clusters of microcalcifications are very difficult because of the small size of the 
microcalcifications and of the poor quality of the mammographic images. Up to 
now, mainly two kinds of approaches have been proposed to tackle this 
problem: those performing the classification by looking at the features of single 
microcalcifications and those based on the classifications of clusters, which in 
turn use features characterizing the spatial distribution of the microcalcification 
in the breast. In this paper we propose a novel approach for recognizing 
malignant clusters, based on a Multiple Classifier System (MCS) which uses 
simultaneously the evidences obtainable from the classification of the single 
microcalcifications and from the classification of the cluster considered as a 
whole. The approach has been tested on a standard database of 40 
mammographic images and revealed very effective with respect to the single 
experts. 



1. Introduction 

Mammography is a radiological screening technique which makes it possible to detect 
lesions in the breast using low doses of radiation. At present, it represents the only not 
invasive diagnostic technique which allows the diagnosis of a breast cancer at a very 
early stage, when it is still possible to successfully attack the disease with a suitable 
therapy. 

A visual clue of breast cancer particularly meaningful is the presence of clusters of 
microcalcifications [^. Microcalcifications are tiny granular deposits of calcium that 
appear on the mammogram as small bright spots. Their size ranges from about 
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0.1 mm to 0.7 mm, while their shape is sometimes irregular. Besides being arranged 
into clusters, microcalcifications can appear isolated and spread over the breast tissue, 
but in this case they are not indication of a possible cancer. However, even in the case 
of clustered microcalcifications their nature is not necessarily malignant, and thus the 
radiologist must carefully analyze the mammogram to decide if the appearance of the 
cluster suggests a malignant case. Such decision is taken on the basis of some 
properties (shape, size, distribution, etc.) related to both the single microcalcifications 

and the whole cluster [^. 

A computer aided analysis could be very useful to the radiologist both for 
prompting suspect cases and for helping in the diagnostic decision as a “second 
reading” [^, especially in the case of a mass screening. Such a tool should improve 
both the sensitivity of the diagnosis, i.e. the accuracy in recognizing all the malignant 
cases and its specificity, i.e. the ability to avoid erroneous recognition of benign 
clusters as malignant. 

In the recent past, many approaches have been proposed for the automatic 
detection and recognition of clusters of microcalcifications (e.g. see [^). Up to now, 
most of the research efforts in this field have been devoted to the detection of 
microcalcifications, which is an inherently complex problem because of the low 
resolution and very low contrast of mammograms. The main approach currently used 
is based on the wavelet transform other proposed methods rely on Gaussian 
filtering, artificial neural networks, texture analysis, mathematical morphology and 
fuzzy logic. 

On the other hand, methods explicitly devoted to the classification phase are v^ 
few and mainly focused on the analysis of the single microcalcifications |Q. 
Additional information attainable by examining the cluster is usually not considered, 
although in several cases it has proven to be essential for a correct diagnosis. 

In this paper we propose a novel approach for the automatic classification of 
clusters of microcalcifications, based on the adoption of a Multiple Classifier System 
(MCS). The proposed system employs two classifiers, one for classifying single 
microcalcifications and the other for classifying the cluster as a whole. The responses 
of both the classifiers, together with their estimated reliability, are used by a 
combination stage to take the final decision. The assumption underlying this approach 
is that the collective decision taken on the basis of the responses of an ensemble of 
classifiers is less likely to be in error than the decision made by any of the indiv idua l 
classifiers, if these latter can provide complementary discriminative information i|E- 
The proposed method has been experimented with a standard database of 
mammograms. Since the focus of this work is on the combining scheme, for the 
classifiers we have adopted two sets of relatively simple features found in the 
literature. The results obtained confirmed the effectiveness of the MCS, whose 
performance resulted better than each of the composing experts. 



2. The Proposed Method 

As previously said, the automatic recognition of malignant clusters of 
microcalcifications is a classification task quite difficult because of the low quality of 
the input images. So, even the most sophisticated segmentation methods can extract 
the microcalcifications with seriously distorted shapes. These distortions make the 
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successive feature selection phase very critical. Various feature sets have been 
proposed up to now, based, among others, on shape analysis jlo) or texture analysis 
[^. Experiments on real Images highlighted advantages and disadvantages of these 
feature sets when used singularly, but none of them resulted definitely optimal. 

A different approach for the classification stage is to consider the whole cluster 
instead of the single microcalcifications, since the distribution of the 
microcalcifications within the breast tissue is recognized as another meaningful clue 
for the final diagnosis . In this case, the features to be employed describe the shape 
of the cluster and other parameters characterizing the distribution of 

microcalcifications within it. These features give good results in situations where 
microcalcifications are heavily distorted in their shape, as long as the cluster they 
belong to is clearly identifiable. Unfortunately, this approach becomes unreliable 
when the cluster is weakly described, i.e. when the total number of 
microcalcifications forming the cluster is low. 

These considerations suggest to suitably employ, in the classification stage, both 
the different approaches, using a MCS which will effectively exploit the 
complementary evidence coming from the two diverse classifiers. In next subsection 
we will introduce the architecture of such system. 



2.1 System Architecture 

The first processing task in a system for the automatic recognition of malignant 
microcalcifications consists in detecting the microcalcifications and grouping them in 
clusters. This task can be carried out by adopting one of the many methods proposed 
up to now in the specialized literature. As the present paper is mainly concerned with 
the classification issues, we consider as our starting point an image in which the 
microcalcifications have already been detected and grouped into clusters (see fig. 1). 
At this point, the problems of feature extraction and classification can be faced. For 
the sake of clarity, we will use hereon the term expert to denote a system composed 
by a classifier and the corresponding feature extractor. According to this terminology, 
we can say that the whole MCS is composed of two different experts: the first one 
(/tC-Expert) is devised for the classification of the single microcalcifications, while 
the second one (Cluster Expert) looks at the entire cluster. 

To classify a cluster containing microcalcifications, each microcalcification is 
classified by the /tC-Expert, while the cluster, considered as a whole, is classified by 
the Cluster Expert. The final classification decision is obtained by collecting their 
responses and applying a suitable decision scheme based on the evaluation of the 
reliability of each classification. 

Both the considered experts employ the same classification model, i.e. a 
Multi-Layer Perceptron (MLP) with a sigmoidal activation function. It is made by a 
three layers fully connected network, containing 25 neurons in the hidden layer, 2 
output neurons (associated with the benign and the malignant class) and a number of 
nodes in the input layer depending on the size of the feature vector employed. Both 
the classifiers have been trained with the standard Back Propagation algorithm, 
adopting a costant learning rate T| equal to 0.5. 

In the next subsections we will describe in more detail the features used by the two 
types of expert and the rules that determine the combiner final decision. 
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Pre-Processed Microcalcification 




Fig. 1. : The architecture of the proposed Multi-Classifier System (MCS). 



2.2 The //C-Expert Features 

Since the main goal of this paper is to evaluate the benefits introduced by the use of a 
MCS we have avoided the definition of new features, preferring the adoption of 
features already known in the literature. 

The //C-Expert uses four features based on the shape of the microcalcifications and 
eight on its texture properties. This choice is based on the experimental evidence that 
the more irregular the shape of the calcification, the higher the risk of breast cancer. 
In particular, benign cases are characterized by round microcalcifications with smooth 
border and uniform shapes, while clusters having microcalcifications with irregular 
border and with various sizes and shapes are typically malignant [^. On this basis, we 
have adopted the following four features, based on the analysis of the shape: 

51. compactness: evaluated as the ratio between the area of the microcalcification 
and the square of the perimeter of its border; 

52. roughness: it is a measure of the irregularity of the border. It is defined in as 
the standard deviation of the square distance of the points of the border from the 
center of the microcalcification; 

53. border gradient strenght: it is an estimate of the intensity gradient measured on 
the points of the border of the microcalcification; 

54. local contrast: it is a measure of the average difference between the intensity of 
the points belonging to the microcalcification and the intensity of the points 
belonging to the background. 

These features are evaluated on an Area of Interest (AOI), made of a square box of 
17x17 pixels centered on the centroid of the microcalcification. 
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The reason for using an additional set of features based on the texture properties is 
twofold. First, in this way we have a characterization of the tissue surrounding the 
microcalcification and (indirectly) of the underlying biological process which has 
produced the microcalcification. Second, in case the shape based features are not 
completely reliable (because of the low quality of the image), the texture based 
features can provide further information to decide about the malignancy of the 
microcalcification. The used texture features are the ones proposed in [^: 

Tl. energy in the AO I: the average square intensity in the AOI; 

T2. energy in the background', the average square intensity in the background; 

T3. average intensity in the AOI, 

T4. standard deviation of the intensity in the AOI', 

T5. entropy of the 1” order histogram', a measure of the uniformity of the 
distribution of the gray levels in the AOI. The higher this parameter, the more 
uniform the distribution of gray levels in the AOI; 

T6. energy of the 2"^ order histogram', the average square value of the co-occurrence 
matrix evaluated on the AOI; 

T7. contrast of the T‘‘ order histogram', a measure of the distribution of the 
difference among the gray levels exhibited by the points in the AOI; 

T8. entropy of the 2"‘‘ order histogram', a measure of the uniformity of the 
distribution of the values in the co-occurrence matrix. 



2.3 The Cluster Expert Features 

The choice of effective cluster features presents the same problems we have 
highlighted in the previous sections. However, experiments made by several research 
groups have shown that three meaningful clues to distinguish benign from malignant 
clusters could be the shape of the cluster, the distribution of the microcalcifications 
within the clusters and the uniformity of their shapes. 

As regards the shape, malignant clusters are typically quite elongated, while benign 
clusters are more round. The only notable exception is given by the clusters created 
by intraductal microcalcifications which show an extreme ellipticity even though they 
are benign. An adequate measure of the ellipticity can be obtained by evaluating the 
ratio between the major and the minor axis of the ellipse of inertia of the cluster. 

The second type of feature we employ concerns the distribution of the 
microcalcifications within the cluster. We consider both the mass distribution and the 
spatial distribution of the microcalcifications, since malignant clusters show a much 
higher density of microcalcifications with respect to the benign clusters. 

The third group of features is based on the presence of microcalcifications with 
irregular and non uniform shapes. Information about the distribution of shape features 
SI and S2 on the microcalcifications belonging to the cluster can be very helpful. In 
fact, the more irregular and the less uniform the shapes of the microcalcifications, the 
more likely a malignant cluster. On the basis of the considerations made, we have 
adopted the following features for the Cluster Expert: 

Cl. ratio between the major and the minor axis of the ellipse of inertia of the cluster; 
C2. mass density of the whole cluster; 

C3. average mass of the microcalcifications; 

C4. standard deviation of the masses of the microcalcifications; 
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C5. average distance among the centroids of microcalcifications and the centroid of 
the cluster; 

C6. standard deviation of the distances among the centroids microcalcifications and 
the centroid of the cluster; 

C7. average of the compactness; 

C8. standard deviation of the compactness; 

C9. average of the roughness; 

CIO. standard deviation of the roughness. 



3. The Combination Scheme 



In the literature different combination criteria have been proposed up to now IB- 
The most popular ones are those based on statistical methods, evidence theory or 
heuristic approaches. In all such proposals the number of confidence degrees coming 
from the classifiers involved in the MCS is fixed. 

In our case, though the number of classifiers is fixed (one Cluster Expert and one 
//C Expert), the number of confidence degrees to be combined is not fixed, because 
the fjC Expert is activated once for each microcalcification contained in the cluster, 
ranging from few instances to more than 100. This leads to two problems: first, the 
unique confidence degree coming from the Cluster Expert has to be combined with a 
variable number of confidence degrees coming from the fjC Expert. Second, the 
relative significance of the single expert changes with the number of the 
microcalcifications. In fact, while the reliability of the fjC expert is not affected by the 
number of microcalcifications, most of the cluster features become unreliable when 
the microcalcifications are very few. 

As regards the first point, if is the total number of microcalcifications in the 
cluster we are considering, we have outputs coming from the fjC expert applied to 
each of the microcalcifications in the cluster: let us call ) , with i=l.. N^, the 



two output values provided by the /J2 expert when classifying the i-th 
microcalcification for the malignant and the benign class. In order to elicit two overall 
confidence degrees, whose values are homogeneous with the outputs of the Cluster 
Expert (which range from 0 to 1), a first solution is to count the microcalcifications 
classified as malignant and benign and normalizing such numbers with respect to N^. 
In other words, if M is the number of microcalcifications for which > O® , the 



estimated confidence degrees {K^, K,) are given by: 



K 



m 




K, 



N,-M 



Unfortunately, while such estimate is very reliable for high values of N^, it does not 
work well when the cluster contains a small number of microcalcifications. For this 
reason, we have also considered another estimate (A^, A^) which provides the desired 
confidence degrees by averaging the outputs produced by the juC Expert: 
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Note that also for the proposed estimates we find the same dependence on that 
we highlighted for the Cluster Expert. To overcome this problem, the most suitable 
combining scheme is the “Weighted Voting” rule, according to which the “vote” (i.e. 
the output) of each expert is weighted by the estimated significance associated to the 
expert; all the votes are finally collected and the input sample is assigned to the class 
for which the sum of the votes is the highest. However, in our case, the weights of the 
different contributions coming from the cluster and from the microcalcifications 
cannot be fixed, but must vary as a function of N^. As a result, we obtain the two- 
stage combining scheme described in Fig. 2. 




Fig. 2. The adopted two-stage combining scheme. 

The first stage (hereon called fjC aggregation) collects the classification decisions 
coming from the juC Expert applied to all the microcalcifications of the input cluster 
and evaluates the two couples of confidence degrees (A^, A^) and K^). These ones 
are successively weighted by two weight functions and fXN). Both the 

functions are linear on N^, but with different slopes: oiN^ is increasing while fXN) is 
decreasing. 

The weighted values coming from the fjC aggregation are finally combined with 
the outputs coming from the Cluster Expert, weighted by a constant y. In conclusion 
the final confidence degrees, said and V^, are obtained in the following way: 

K. - a[N,c)- K + P[n,c)- K„ + r-c„ 
V,-a{N^,)-\+/]{N^,)-K,+r-C, 

The parameters of the weight functions and y are evaluated by means of an 
optimization phase whose details are given in the following section. As a result, two 
particular values of N are determined such that the highest weight is given to the pair 
(A„, Al) for N <Nfi and to the pair (K^, K) for A > Aj; thus, in these two intervals, the 
contribution coming from the for /tC-Experts is prevailing with respect to the Cluster 
Experts. This is consistent with the experimental evidence that the cluster features are 
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not very reliable with a low number of microcalcifications, while, for clusters with 
many microcalcifications, the combined outputs of the /XT-Experts are more accurate. 



4. Experimental Results 

For testing our approach we have used a public database (available at the site 
http://figment.csee.usf.edu/) of 40 mammographies containing 72 malignant clusters 
and 30 henign ones, with 1792 malignant and 331 benign microcalcifications. Images 
were provided by courtesy of the National Expert and Training Centre for Breast 
Cancer Screening and the Department of Radiology at the University of Nijmegen, 
the Netherlands. All images are 2048 by 2048 and use 12 bits per pixel of gray level 
information. Some preprocessing was performed to convert the images to a 8 bit/pixel 
format using an adaptive noise equalisation described in [ p^ . 

The database has revealed to be a severe test bed for our approach, since the 
dimensions of the microcalcifications were typically very small. Moreover, the low 
number of clusters makes very difficult the learning of experts based on neural 
classifiers. For this reason, we have adopted a leave-one-out approach for our 
experiments. With this method, the learning of the classifiers was performed with a 
training set containing 101 clusters, while the one remaining cluster was used as a test 
sample. More precisely, for each trial, the 90% of the training set was actually used 
for training the neural networks, while the 10% (randomly extracted) was employed 
to estimate the optimal parameters for the weight functions oc[n and In this 

way, the classification is repeated until all cluster have been used once as a test 
sample. 

In order to evaluate the diagnostic ability of the MCS in recognizing malignant 
clusters, we have employed the Receiver Operating Characteristic curve {ROC curve) 
jl^ . This is a graph obtained by calculating the sensitivity and specificity for every 
operating point and plotting sensitivity (estimated as the true positive rate, or TP rate, 
i. e. the ratio of actual benign cases among the cases classified as positive) against 
(1 - specificity) (estimated as the false positive rate, or FP rate, i. e. the ratio of 
malignant cases among the cases classified as positive). A diagnostic system that 
perfectly discriminates between the two classes would yield a curve that coincides 
with the left and top sides of the plot. A test that is completely useless would give a 
straight line from the bottom left corner to the top right corner. In practice, there is 
always some overlap in the two classes, so the curve will lie between these extremes. 

For evaluating the ROC curve of the MCS we have used the ROCKIT 0.9B package, 
developed by Metz and his research group at the Department of Radiology of the 
University of Chicago. This software is designed to fit binormal ROC curves to both 
continuously-distributed and ordinal category diagnostic test results, on the basis of a 
maximum- likelihood estimate [^j. 
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F P rate 

Fig. 3. The ROC curves of the single experts and of the whole MES. 

The ROC curves obtained for the MCS and the single experts from the tests on the 
data set described above are shown in fig. 3. It is possible to note that the MCS 
provides a discriminative performance sensibly better than the single experts: in fact, 
its ROC curve is at the left and over the curves of the experts. Among them, the 
Cluster Expert has the best performance, while the two aggregations realized on the 
/tC-Expert give quite similar results. 

To have a quantitative evaluation of the performance of the MCS and its 
componing experts, we have also evaluated the area under the respective ROC curves. 
This parameter provides a global assessment of the performance of the test 
(sometimes called diagnostic accuracy) and is equal to the probability that a random 
malignant sample has a higher value of the measurement (in our case, of the 
malignant output of the classifier) than a random benign sample. More precisely, this 
probability is 0.5 for an uninformative diagnostic system, while equals 1.0 for a 
perfectly discriminative diagnostic system. In particular, the optimal parameters for 
the weight functions have been estimated by maximizing this parameter. 

The obtained values are shown in table 1. Also in this case, the MCS reveals to be 
better than the componing experts. In particular, while there is some improvement 
with respect the Cluster Expert, the difference with the //C-Expert is very high. It is 
worth noting that, even though its performance is not very good, the //C-Expert 
contribute is very helpful to enhance the discriminative ability of the MCS. 



Table 1. The measures of the areas under the curves presented in fig. 3. 





Cluster 

Expert 


/C-Expert 

(A.AJ 


//C-Expert 


MCS 


Area under the 
ROC curve 


0.71 


0.67 


0.57 


0.79 
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5. Conclusions 

In this paper we have presented a novel method for classifying clusters of 
microcalcifications, which is based on a Multiple Classifier approach in order to 
exploit the evidence coming both from single microcalcifications and from the cluster 
as a whole. An experimental analysis performed on a standard database has shown the 
effectiveness of the approach with respect to single classifier systems. 
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Abstract. In order to determine the output from an aggregated classifier a 
number of methods exists. A common approach is to apply the majority-voting 
scheme. If the performance of the classifiers can be ranked in some intelligent 
way, the voting process can be modified by assigning individual weights to 
each of the ensemble members. For some base classifiers, like decision trees, a 
given node or leaf is activated if the input lies within a well-defined region in 
input space. In other words, each leaf-node can be considered as defining a 
given feature in input space. In this paper, we present a method for adjusting the 
voting process of an ensemble by assigning individual weights to this set of 
features, implying that different nodes of the same decision tree can contribute 
differently to the overall voting process. By using a randomised “look-up 
technique” for the training examples the weights used in the decision process is 
determined using a perceptron-like learning rule. We present results obtained by 
applying such a technique to bagged ensembles of C4.5 trees and to the so- 
called PERT classifier, which is an ensemble of highly randomised decision 
trees. The proposed technique is compared to the majority-voting scheme on a 
number of data sets. 



1 Introduction 

Combining multiple classifiers to obtain improved performance by now is a technique 
widely used [1]. Using an ensemble classifier [2] it is possible to reduce the mean 
error rate over that of the individual classifiers and often the ensemble outperforms 
even the strongest individual member of the ensemble. Ideally, one wants to combine 
independent classifiers, but as it is often impossible to obtain independent classifiers, 
the aim of ensemble training algorithms is to create classifiers with low inter- 
correlation and moderate individual strengths. 

During the last decade, a number of methods for growing ensemble classifiers has 
been proposed, most widely known is the techniques denoted as boosting [3], bagging 
[4] and randomisation [5]. Both bagging and boosting perturb the learning process 
from one ensemble member to the next by using individual training sets obtained by 
resampling the original training set. In addition to this, boosting also assign weights to 
the individual classifiers. The randomisation techniques operate by having one or 
more randomisation steps integrated into the model building process. For a general 
discussion of these and other techniques, see e.g. [6,7]. 
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The outputs of a given ensemble member is often scaled so that the values can be 
considered as estimates of the posterior likelihoods of a given class given the input 
example and the trained model. Alternatively, one may only pay attention to the class 
that obtains the largest output. Depending on these interpretations, a number of 
schemes exist for combining the classifiers. If the errors made by individual 
classifiers to a large degree are uncorrelated, product rules for combining the 
probability estimates can be used. A more robust scheme is to have each classifier 
vote on just one class and then apply a simple majority rule for deciding the winner. 
The simple majority rule can be modified to put different weights on different 
members, which of course implies that some sensible quality measure must be used to 
rank the performance of the individual classifiers. Such decision rules are all fixed. 
Alternatively, one can train a decision model by using all the outputs of the individual 
classifiers as input to a classification model [8]. 

A theoretical analysis of the n-tuple classifier [9], which is an ensemble-based 
classifier, revealed that for this classifier the majority- voting rule, which is 
conventionally used, often could be a bad choice. As a result the decision border has 
to be adapted to the specific set of ensemble members, and a method for obtaining an 
adequate decision border for a voting ensemble was recently derived [9,10]. The 
principle behind the modified decision scheme is to assign different weighting 
schemes (including biases) to each of the possible output scores. It turns out that in 
many situations (especially with highly randomised ensembles) a simple adaptation of 
the decision border in what we denote “score space” can improve the performance of 
the ensemble. Accordingly, an initial decision border (normally corresponding to the 
majority voting scheme) is adjusted by considering the individual score points as 
acting with forces on the border. In the present paper, we extend this idea. If it is 
beneficial to adjust the weightings of the class votes, then it may also be advantageous 
to adjust the influence of the features leading to the class-scores. Instead of having the 
score points act with forces on the decision border we use the opposite approach of 
having a given decision border act on the individual score points. In the case of 
decision trees, this implies that different leaves of a single tree will contribute 
differently to the overall voting process. Especially in cases where the ensemble 
building process is based on a large degree of randomisation, the quality of the 
different substructures of a decision tree with respect to separating the classes can 
vary quite a lot. Breiman has shown that the use of ensembles of randomised decision 
trees corresponds to performing a kernel operation over the input space [11]. The 
method presented below can be seen as adapting this kernel to the individual look-up 
paths, which the examples traverse on their way through the “decision forest”. 

Below we describe an algorithm that can be applied to modify the score values 
obtained for each example. In addition we present results obtained by applying the 
technique on a number of data sets using respectively a bagged ensemble of C4.5 
classifiers and the so-called PERT ensemble classifier (Perfect Random Tree) [12,13]. 



2 Non-optimality of the Majority Voting Scheme 

Consider a classification problem specified by a set of m training 
examples, [(xj,yj),(x 2 ,>’ 2 )’---(Xm’>’m)) ’ where X; define the input vector and y,- is 
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the corresponding class label. The set of all input vectors x,- is denoted X . In the case 
of a voting scheme, a single classifier trained on X performs a mapping from the 
input-space to a binary-valued output space, I— >B. Usually, the output takes the 
form of a one-in-n encoding of the output class. By creating a voted ensemble, the 
overall transformation can be seen as a transformation from input to output via an 
intermediate score space, I — > S — > B . The class scores are simply the average (or 
sum) of the output vectors obtained from the individual classifiers. 

As mentioned in the introduction, the majority- voting scheme will not always be 
an adequate choice for combining voting classifiers. This will especially be true if the 
training distribution is highly skewed over the classes or if the spread of the classes in 
input space is highly different. Correspondingly, a modification of the decision 
scheme is needed [10]. In [14] it is shown that the scheme developed for adjusting the 
decision border for the n-tuple classifier also can be useful for other types of 
ensemble based classifiers, especially highly randomised ones such as the PERT 
classifier [12,13]. In the next sections, we will briefly review the procedure described 
in [10]. 



2.1 Modified Decision Border 

The method used to adjust the decision border in score space [10] can be explained as 
follows. After training an ensemble of classifiers, the ensemble is used to classify a 
set of validation examples. Preferably, the scores are obtained using leave-one-out 
cross-validation, hence reducing the need for a separate validation set (it is simple to 
perform a leave-one-out classification when using the n-tuple classifier [15]). Figure 1 
illustrates a case where inspection of the distribution of score values obtained on a 
validation set reveals that the error rate could be lowered by using another decision 
border than that corresponding to the majority-vote decision. One way for handling 
this problem could be to train another classifier to separate the examples in score 
space, e.g. using a Support Vector Machine [16,17]. A simple approach is to restrict 
the decision border to a line, and then use a force-field analogy to adjust the line 
parameters [10]. This technique has been applied to the case shown in Figure 1. As 
illustrated in Figure 2 the gained performance on the validation set generalises well to 
the test set. 



2.2 Adaptive Feature Weighting 

For a decision tree, there will in general be several leaf nodes that all result in the 
same output. The “activation” of these leaf nodes, however, is caused by different 
input values that belong to different regions in the input space. The method of 
changing the decision border as outlined above treats all scores obtained on a given 
class equivalently - it is not possible to separate two examples having identical scores 
no matter how the border is modified. If we instead allow the individual score values 
to be modified it may be possible to solve such conflicts, and at the same time we 
could adapt to a given decision border. 
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Cross-validation 




Score on class 1 

Fig. 1. Score values in score space between two of the three classes in the DNA data set. The 
majority-voting scheme fWTA) is not optimal to separate the classes. A new border (LinMAP) 
is made to optimise the performance on the validation set 



Test 




Score on class 1 

Fig. 2. Application of the initial and adjusted border depicted in Fig. 1 on the test set. The 
misclassification between the two classes shown decreases from 23% to 7% 

In order to be able to modify the score values we propose to assign a set of weights 
(one for each class) to each leaf node of all the decision trees that make up a given 
ensemble. The values in score-space are now defined, not as the average of the votes 
on the individual classes, but as the sum (or average) of the class weights of the 
activated leaf nodes. By adjusting the weights, it is therefore possible to change the 
score values resulting from a given input, effectively moving the examples to other 
locations in score-space. How should we then control these movements? We suggest 
using the opposite of the above-described scheme: Instead of changing the location of 
the decision border caused by forces from the examples, we look at the opposite 
forces. A force from the majority-vote decision line now influences each example 
causing the examples to move in score-space. 
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Let w j I denote the weight vector of the i ’th leaf node in the j ’th tree. For short, 
we denote the weight vector being activated in the 7’th tree by an example x as 
"'fx- "'ji contains scalar weights, for each class c. When classifying an 

example, the output of the ensemble is found as 



y = fix) = argmax 



j 



( 1 ) 



Each leaf is normally assigned a specific class label, given as the most frequent 
class within the training examples falling into the leaf. In the proposed scheme, the 
weights are initialised to zero with the exception that the weight corresponding to the 
class label of the leaf is assigned the value one. With this choice of initialised values, 
the initial output corresponds to a majority vote decision. 

Besides reducing the number of errors on a validation set the adjustment scheme 
should also attempt to ensure a certain margin between the score on the true class and 
the closest competitor. One such scheme for adjusting the weights is listed in Figure 
3. In order to adjust the weights, it is needed to use a set of examples '^Adj 

known class labels. However, in order to model the score values that will be obtained 
on examples not used for building the decision trees it is important that the set '^Adj 

somehow deviate from the training data set. Still it might be a good idea that '^Adj 

also contains the examples used for training the trees. One possibility for having 
separate examples reserved for the weight adjustment procedure is to reserve a part of 
the training examples for this adjustment set. However, it would often be desirable if 
all available training examples could be used for building the individual classifiers. 
This is possible if the examples used for adjustment are based on adding small but 
varying amounts of noise to the training examples (i.e. training with jitter). We have 
both tried this concept as well as using a separate part of the training set for 
adjustment. The noise we use is additive Gaussian noise. The standard deviation for 
the noise is chosen as a fraction (between one and ten percent) of the deviation 
calculated for the input variables in the total training set. 

When applying the weight adjustment scheme the ensemble of decision trees can 
be interpreted as an advanced feature extractor, see Figs. 4 and 5. Training the 
weights then corresponds to detecting and weighting the combinations that in an 
adequate way can discriminate between the classes in question. 

The suggested method is applicable for ensembles of any classifier where the 
output can be related to a specific activation pattern. In the following section, we 
present results of applying the weighting scheme to the PERT classifier and to bagged 
ensembles of C4.5 trees. 
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For all x,-eX 



AdJ • 



yi = /'(x/) or yi =/(Xj-+n) // n Is a noise vector 



<^*yi 



I 



= argmax y. w ; ^ // Runner up class 



//A>dji-ist examples within a given margin t 

if 

J J 

For all trees, 

//Increase weight on the true class 
‘V/.xi.r,- = +Am 

i f y*yi, 

//Decrease weight on false winner 



W ; 



else 

//Decrease weight on runner up class 



. — Aw 



,Cr 

end. i f 
next tree 



— w ; 



J ,Cf 



Aw 



Fig. 3. Adjustment scheme for determining the weights. The threshold t determines the margin 
region in score space where examples will influence the weight adjustment. 







Fig. 4. Ensemble of four decision trees. The white and black nodes are inactive and active leaf 
nodes respectively 



o#oo«cxxx:)«ooo«c 

Fig. 5. Binary activation pattern corresponding to the ensemble depicted in Fig 4 
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3 Results 

In order to investigate the proposed weighting scheme, we have implemented and 
tested the algorithm on two types of ensemble based classifiers. The classifiers have 
then been applied to a number of data sets from the UCI repository [18], the Statlog 
project [19] and an artificial data set described by Breiman [4]. 

Table 1 lists the preliminary results of the proposed scheme. The “voted” column 
corresponds to the error-rate of the unweighted ensemble, other results are given after 
15 iterations. A noise level of 0.01 specifies that the noise contribution to a given 
input variable is drawn from a normal distribution having a standard deviation being 
100 times smaller than the standard deviation measured for the input variable in 
question. For the runs having a noise level of zero, only 70% of the training examples 
were used to build the trees, but all training examples were used to adjust the weights. 
All results are obtained as the average error rate over 10 runs. For each run, the data 
sets are randomly split into training and test sets, the sizes of which are listed in Table 
1. For C4.5, bagging is used to create the ensembles. In the case of PERT the 
randomisation takes place within the model construction itself, and it is therefore not 
necessary to apply a technique like bagging. An ensemble size of 100 trees is used for 
all experiments, but it should be noted that for several of the data sets the 
performances could be improved by using more classifiers in the ensembles. The 
threshold level determining the desired margin (see Figure 3) is set to 70% of the 
maximum number of votes that can be obtained using the majority rule. The size used 
for Aw was 0.1, see Figure 3. 

It can be observed that the weight adjustment scheme in the case of PERT leads to 
an essential performance improvement on Cut20, DNA, Belgianll and the Ringnorm 
data sets. Smaller improvements are obtained on the BelgianI data set as well as on 
the Sonar and the Vowel data sets. Eor C4.5, improvements can be observed on 
Belgianll, Cut20, Image, Ringnorm, Vehicle and Vowel. The very large improve- 
ments on PERT takes place on exactly those data sets, where PERT performs poorly 
compared to the C4.5 ensemble. It is also noted that the strategy of leaving an amount 
of the training examples aside when training the trees and not using jitter is 
sometimes better than using all examples for building the trees. 

Eigs. 6 and 7 illustrates how the error rates evolve over the adjustment iterations. 
On the Liver data set, the weight adjustment is leading to overfitting, which might be 
caused by having too few examples in the training set. For the case shown in Fig. 6 
the overfitting can actually be revealed during the training phase by using an artificial 
validation set, obtained by adding noise to the training examples. Unfortunately, this 
approach for observing whether overfitting occurs is not always working. 
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Table 1. Test error rates of ensemble based classifiers on a number of problems 







PERT 


C4.5 
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weighted, noise level 




weighted, noise level 
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100 


0.05 


oro 


Voted 


o 


too 


0.05 


oro 


Belgiani 


1250/ 


3.5 


2.9 


3.1 


3.1 


3.0 


2.5 


2.6 


2.5 


2.4 


2.3 


1250 


±0.5 


±0.5 


±0.4 


±0.4 


±0.4 


±0.5 


±0.4 


±0.4 


±0.4 


±0.6 


Belgianii 


2000/ 


7.3 


4.3 


4.7 


4.3 


4.3 


1.9 


1.7 


1.6 


1.7 


1.6 
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±0.5 


±0.7 


±0.8 


±0.8 


±0.9 


±0.4 


±0.5 


±0.5 


±0.5 


±0.4 


Cut20 


11220/ 


3.8 


2.6 


2.8 


2.5 


2.8 


2.7 


2.5 


2.4 


2.4 


2.4 


7480 


±0.2 


±0.2 


±0.1 


±0.1 


±0.1 


±0.2 


±0.2 


±0.2 


±0.2 


±0.2 


DNA 


2000/ 


27.5 


10.4 


10.9 


10.8 


10.5 


6.0 


5.5 


5.8 


5.8 


5.8 


1187 


±2.1 


±1.0 


±1.3 


±0.9 


±1.0 


±1.0 


±0.8 


±0.8 


±0.8 


±0.8 
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2079/ 


2.5 
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1.8 
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2.8 


2.4 
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2.1 
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±0.8 


±0.7 


±0.6 


±0.8 


±1.1 


±1.0 


±1.2 


±1.2 


±1.0 


±0.9 
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315/ 


7.6 


3.9 


6.1 


7.5 


4.7 


7.2 


7.2 


7.8 


6.7 


7.2 


16 


±3.9 


±3.0 


±3.7 


±3.7 


±3.5 


±4.0 


±3.8 


±3.4 


±3.5 


±3.5 
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25.6 


31.4 


31.4 


30.6 


32.3 


27.9 


30.6 


30.3 


28 


28.3 


35 


±6.2 


±10.2 


±6.2 


±7.5 


±6.3 


±7.8 


±10.2 


±7.6 


±6.4 


±7.1 
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300/ 


11.1 


5.8 
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6.0 


7.1 


9.5 


7.5 


7.8 


7.3 


7.1 
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±1.9 


±0.4 


±0.6 


±0.6 


±0.6 


±1.8 


±1.2 


±1.8 


±1.8 


±1.8 
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11.4 
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21 
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±8.7 


±8.7 
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27.1 


28.5 


25.9 


24.5 
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22.5 


20.8 


85 
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±4.3 


±3.2 


±4.2 


±4.9 


±4.5 


±3.7 


±0.5 


±4.4 


Vowel 


891/ 


2.2 


1.3 


1.4 


1.4 


2.0 


8.3 


8.1 


4.9 


4.3 


4.4 


99 


±1.6 
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±1.4 


±1.4 


±1.0 


±3.0 
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±2.4 


±1.8 


±2.2 
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391/ 


3.6 


4.1 


3.4 


3.2 


3.2 


5.0 


5.5 


5.7 


5.7 


5.7 


44 


±2.7 


±3.5 
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±2.6 
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Error rate as function of weight adjusting iterations 
Liver, PERT 




Iteration 



Fig. 6. Error rates versus number of iterations for the liver data set applied to the PERT 
ensemble. On this problem, the method leads to overfitting 
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Error rate as function of weight adjusting iterations 
DNA, PERT 




Iteration 



Fig. 7. Error rates versus number of iterations for the PERT classifier applied to the DNA data 
set 



4 Discussion 

In the present paper, we have investigated the effect of introducing individual 
weighting of the features specified by the leaves of a decision tree. Introducing such 
weights will influence the decision process of the whole ensemble in a local manner. 
The potential benefit from a theoretical point of view is the possibility of increasing 
the modelling capability of ensembles. A drawback is the fact that the individual 
weights have to be estimated partly from examples that differ from the examples that 
are used when building the individual classifiers. The approach taken here for dealing 
with this problem is to extract further examples from the training set by applying 
jitter. For adjusting the weights, we have applied a perceptron learning rule. In the 
present study, we have simply used a specific number of adjustment iterations, but 
due to the large number of weights we are faced with the risk of overfitting. Early 
stopping is one (too) simple way for handling this problem. 

The preliminary results presented in this work shows that under certain 
circumstances it can be highly beneficial to use the suggested weighting scheme, 
while in other cases, it might lead to overfitting. Overfitting is more likely to happen 
for the small data sets. Future work should investigate methods for finding the proper 
amounts of noise to add to the training examples and look for improved ways for 
adjusting the weight values. 
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Abstract. A classifier team is used in preference to a single classifier 
in the expectation it will be more accurate. Here we study the poten- 
tial for improvement in classifier teams designed by the feature subspace 
method: the set of features is partitioned and each subset is used by 
one classifier in the team. All partitions of a set of 10 features into 3 
subsets containing (4, 4, 2) features and (4, 3, 3) features, are enumerated 
and nine combination schemes are applied on the three classifiers. We 
look at the distribution and the extremes of the improvement (or fail- 
ure); the chances of the team outperforming the single best classifier if 
the feature space is partitioned at random; the relationship between the 
spread of the individual classifier accuracy and the team accuracy; and 
the combination schemes performance. 



1 Introduction 

We examine by an enumerative experiment what the support is for the intuition 
that a team of classifiers performs better than the single best classifier in the 
team. The feature subspace method has been used: we partition the set of fea- 
tures into subsets where each subset is used by one classifier in the team. Using 
different feature subsets has been recognized as a promising team design method, 
especially in text recognition [I12llbj and speech recognition p. Kittler et al. jOJ 
P derive a series of theoretical results based on the assumption that the indi- 
vidual classifiers use conditionally independent subsets of features. Sometimes 
the features are naturally grouped and this suggests which of them should be 
used together. For example, Duin and coauthors P (and earlier m) study clas- 
sifier fusion methods for recognizing handwritten numerals by using 6 types of 
different features sets: Fourier coefficients, profile correlation, Karhunen-Loeve 
coefficients, pixel averages in 2x3 windows, Zernike moments and morphologi- 
cal features. Random sampling from the feature set for designing the individual 
classifiers has been studied in mm- A genetic algorithm for partitioning the 
feature space is proposed in |8I1(I . 

Here we offer an exhaustive experimental study with L = 3 classifiers and 
a data set with n = 10 features enumerating all partitions of the feature set 
into (4,4,2) and (4,3,3) features. Let Pt be the accuracy of the team. Pi, be the 
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best (maximal) individual accuracy, and be the worst (minimal) individual 
accuracy. 

We seek answers to the following questions: 

1. How is Pt — Pb distributed and what are the maximal and the minimal 
possible values for different combination schemes? 

2. How likely is an improvement {Pt — Pb > 0) if we pick a random partition 
of the set of features? 

3. Is the team accuracy Pt related to the range Pb — Pw of individual accuracies? 

4. How do the combination schemes compare with respect to the answers to 
the previous three questions? 

Section 2 details the combination methods used, so that they be reproducible 
from the text. Section 3 contains the results of our experiment and the conclusion 
section offers the answers to the above questions. 

2 Combination Methods 

Let T> = {Di,D 2 , ■ ■ ■ ,Dl} be a set of classifiers and 17 = {wi,...,Wc} be a 
set of class labels. Each classifier gets as its input a feature vector x G 5ft". 
The classifier output is a c-dimensional vector Di{x) = [di_i(x), . . . , di^c(x)]^ 
where dij{x) is the degree of “support” given by classifier Dt to the hypothesis 
that X comes from class ujj, j = l,...,c. Without loss of generality we can 
restrict dij{x) within the interval [0, 1], i = 1, . . . , L, j = 1, . . . , c, and call the 
classifier outputs “soft labels”. Most often dij{x) is an estimate of the posterior 
probability P{uji\x). 

Combining classifiers means we combine the L classifier outputs D\ (x) , . . . , 
I?l(x) to get a soft label for x, denoted D{x) = [^i(x), . . . ,/Ltc(x)]^. 

If a crisp class label of x is needed, we can use the maximum membership 
rule: Assign x to class ujs iff. 



<^i,s(x) > dij{x) V_) = 1, . . . , c. for individual crisp labels 
Ms(x) > /it(x), Vt=l,...,c. for the final crisp label. (1) 

Ties are resolved arbitrarily. The minimum-error classifier is recovered from (O 
when ^i(x) = P{uJi\x). 

2.1 Majority Vote, Maximum, Minimum, Average, Product 

For the majority vote combination (MAJ), the class label assigned to x is the 
one that is most represented in the set of L class labels Di (x) , . . . , Dl (x) . For 
the remaining simple combination methods, 

Mi(x) = 0(dij(x),...,dLj(x)), j = l,...,c. (2) 

where O is the respective operation (maximum (MAX), minimum (MIN), aver- 
age (AVR) or product (PRO)). 
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2.2 Naive Bayes (NB) 

This scheme assumes that the classifiers are mutually independent (this is the 
reason we use the name “naive”); Xu et al. uni and others call it Bayes com- 
bination. For each classifier Di, a c x c confusion matrix CM* is calculated by 
applying Di to the training data set. The (fc, s)th entry of this matrix, cm). ^ is 
the number of elements of the data set whose true class label was u>k , and were 
assigned by Di to class Wg • By cm* ^ we denote the total number of elements 
labeled by Di into class ojg (the sum of the sth column of CM"^). Using cm! a 
c X c label matrix LM'‘ is computed, whose (fc, s)th entry lm\. ^ is an estimate of 
the probability that the true label is ujk given that Di assigns crisp class label s. 

cml „ 

Iml, = P (o.fc| A(x) = o.g) = (3) 

cm! s 

Considering the label matrix for Di, LM^, associated with ujg is a soft label 
vector 

[P {coi\Di{x) = uJs) , . . . , P (uJc\Di{x) = Wg)]^, which is the sth column of the 
matrix. Let si,...,sl be the crisp class labels assigned to x by classifiers 
Di, . . . ,Dl, respectively. Then, by the independence assumption, the estimate 
of the probability that the true class label is ujj, is calculated by 

L L 

IJ-ji^) = Y[ = Si) = Y[lm],., j = l,...,c. (4) 

i=i i=l 

2.3 Behavior-Knowledge Space (BKS) 

Let again (si,...,sl) G 17^ be the crisp class labels assigned to x by classi- 
fiers Di,. . . ,Dl, respectively. Every possible combination of class labels is an 
index regarded as a cell in a look-up table (BKS table) |S|. The table is de- 
signed using a labeled data set Z. Each G Z is placed in the cell indexed by 
Di{zj), . . . , D[^{zj). The number of elements in each cell are tallied and the most 
representative class label is selected for this cell. Ties are resolved arbitrarily and 
the empty cells are labeled appropriately (e.g., at random or by majority, if ap- 
plicable). After the table has been designed, the BKS method labels an x G 5R" 
to the class of the cell indexed by Ui(x), . . . , Ul(x). 

2.4 Wernecke’s Method (WER) 

The model is similar to the BKS. The difference is that in constructing the table, 
Wernecke m considers the 95 % confidence intervals of the frequencies in each 
cell. If there is overlap between the intervals, the L confusion matrices are used 
to identify the “least wrong” classifier among the L members of the team. First, 
L estimates of the probability P{error and Di{x) = Si) are calculated. Then 
the classifier with the smallest probability is nominated for labeling the cell. For 
an X G 5ft", the cell is identified by the labels assigned by D\, . . . , Dl and then 
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either the cell label is recovered or the label of the nominated classifier is taken 
as the label of x. 



2.5 Decision Templates (DT) 



The classifier outputs can be conveniently organized in a decision profile as 
the following matrix 



DP(x) 



dni(x) ... dij(x) ... dpc(x) 
di,i(x) ... dij(x) ... di,c(x) 
dL,i(x) ... dLj(x) ... dL,c(x) 



(5) 



Using decision templates (DT) for combining classifiers is proposed in j^. 
Given L (trained) classifiers in V, c decision templates are calculated from the 
data, one per class. 



DT,= ^ ^ DP(z,), z=l,...,c. (6) 

Zji 

DTi can be regarded as the expected DP{x) for class oji. The support for the 
class offered by the combination of the L classifiers, ^i(x) is then found using a 
measure of similarity between the current DP(yi) and e.g., 

c L 

dE{DP{x.),DT,) = ^^(4j(x) - dU(fc, j))2, (7) 

j = l k=l 

where dti(k^j) is the fc,j-th entry in decision template DTi. Here we use Eu- 
clidean distance for calculating the similarity but other measures can also be 
applied. 

3 The Experiment 

We used the Wisconsin Diagnostic Breast Cancer data bas^ taken from the 
UCI Repository of Machine Learning Databas^. The set consists of 569 patient 
vectors with features computed from a digitized image of a fine needle aspirate 
of a breast mass. They describe characteristics of the cell nuclei present in the 
image. The objects are grouped into two classes: benign and malignant. Out of 
the original 30 features we used the first 10; these were the means of the relevant 
variables calculated in the image. The study was confined to 10 variables for 
two reasons: to enable a reasonable enumerative experiment and to enhance 



^ Created by Dr. William H. Wolberg, W. Nick Street and Olvi L. Mangasarian, 
University of Wisconsin 

^ http://www.ics.uci.edu/ mlearn/MLRepository.html 
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variability in classifier performance. The data set was split randomly into two 
halves, one being used for training and one for testing. 

We considered L = 3 classifiers. All partitions of the 10-element feature set 
into (4, 4, 2) (3150 partitions) and (4, 3, 3) (4200 partitions) were generated. For 
each partition, three classifiers were built, one on each subset of features. Two 
simple classifier models were tried: the linear and the quadratic classifier, leading 
to 4 sets of experiments: 

1. (4,4,2) with linear classifiers; 

2. (4,4,2) with quadratic classifiers; 

3. (4,3,3) with linear classifiers; 

4. (4, 3, 3) with quadratic classifiers. 

To answer the four questions in the Introduction, 

1. The minimal and the maximal values of the differences between the accuracy 
of the team and the best individual accuracy {Pt — Pb) for the 9 combination 
schemes are shown in Table E We denote by Pia the individual average of the 
team. The bar above P denotes the mean value over all generated teams for 
the respective experiment. Example histograms of PAVR — Pb and Pavr — Pw 
are given in Figure Ql 




Fig. 1. Histograms illustrating the distribution of the improvement for the Average 
aggregation method, experiment (4, 3, 3) and quadratic individual classifiers. 



2. Given in Tableware the fraction of cases when Pt > Pb (the probability that 
the team is better than the single best classifier) and also the fraction when 
Pt < Pw (the team is worse than the worst classifier). As an illustration. 
Figure O (left) plots the accuracy of the Decision Template combination. 
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Table 1. Results from the 4 experiments: 2 partitions x 2 classifier models. 



(4,4,2) 

linear classifiers 
Pjn = 83.60 % 
Pia = 87.62 % 
Pb = 90.43 % 



Comb. 

sch. 


min 

Ft - Pb 

(in %) 


max 

Pt - Pb 

(in %) 


% 

better 

single 

best 


% 

worse 

single 

worst 


Corr 

with 

Ffa - F^ 
(in %) 


MAJ 


-3.51 


2.81 


32.13 


0.13 


3.46 


NB 


-3.51 


3.16 


42.03 


0.06 


8.42 


BKS 


-8.42 


3.16 


31.40 


2.22 


18.52 


WER 


-5.96 


3.51 


36.86 


1.87 


16.29 


MAX 


-4.56 


3.51 


37.21 


0 


-42.08 


MIN 


-4.56 


3.51 


37.21 


0 


-42.08 


AYR 


-4.21 


3.51 


46.67 


0 


-18.68 


PRO 


-4.21 


2.81 


44.38 


0.13 


-19.17 


DT 


-2.11 


4.21 


79.62 


0 


- 0.54 



(4,3,3) 

linear classifiers 
= 87.72 % 
Pia = 90.46 % 
Pb = 92.42 % 



Comb. 

sch. 


min 

Ft - Pb 

(in %) 


max 

Ft - Pb 

(in %) 


% 

better 

single 

best 


% 

worse 

single 

worst 


Corr 

with 

Pb - Pn, 

(in %) 


MAJ 


-3.86 


2.46 


41.10 


0 


-17.16 


NB 


-3.86 


2.46 


39.29 


0 


-19.06 


BKS 


-7.72 


2.46 


20.26 


3.26 


-12.80 


WER 


-7.02 


2.46 


19.17 


2.88 


-17.53 


MAX 


-3.16 


3.51 


59.86 


0 


-58.55 


MIN 


-3.16 


3.51 


59.86 


0 


-58.55 


AYR 


-2.81 


3.51 


68.29 


0 


-33.07 


PRO 


-2.81 


3.16 


64.81 


0 


-41.23 


DT 


-1.75 


4.21 


86.67 


0 


-45.18 



(4,4,2) 

quadratic classifiers 
P„ = 85.35 % 

Pia = 89.01 % 

Pi, = 91.55 % 



Comb. 

sch. 


min 

Ft - Pb 

(in %) 


max 

Ft - Pb 

(in %) 


% 

better 

single 

best 


% 

worse 

single 

worst 


Corr 

with 

Pb - Pnj 

(in %) 


MAJ 


-3.86 


2.46 


18.67 


0.25 


-6.53 


NB 


-3.86 


2.46 


18.79 


0.25 


-2.51 


BKS 


-4.21 


2.46 


21.56 


1.59 


6.39 


WER 


-4.21 


3.16 


19.21 


1.24 


7.18 


MAX 


-3.16 


2.81 


29.71 


0.25 


-3.37 


MIN 


-3.16 


2.81 


29.71 


0.25 


-3.37 


AYR 


-3.16 


2.81 


27.87 


0.06 


2.18 


PRO 


-3.16 


2.46 


28.00 


0.25 


0.20 


DT 


-2.81 


2.46 


35.75 


0.25 


4.98 



(4,3,3) 

quadratic classifiers 
P„ = 88.43 % 

Pia =91.17 % 

Pi, = 93.29 % 



Comb. 

sch. 


min 

Ft - Pb 

(in %) 


max 

Ft - Pb 

(in %) 


% 

better 

single 

best 


% 

worse 

single 

worst 


Corr 

with 

Pb - Pn, 
(in %) 


MAJ 


-4.56 


3.16 


26.90 


0.19 


-18.31 


NB 


-4.56 


3.16 


26.90 


0.19 


-18.17 


BKS 


-5.96 


3.51 


24.24 


3.55 


-1.62 


WER 


-5.61 


3.16 


20.64 


4.33 


-0.52 


MAX 


-3.86 


2.81 


20.38 


1.29 


-9.21 


MIN 


-3.86 


2.81 


20.38 


1.29 


-9.21 


AYR 


-3.86 


3.51 


28.14 


0.14 


-2.51 


PRO 


-3.51 


2.81 


17.62 


0.95 


-1.16 


DT 


-3.86 


3.16 


23.48 


0.43 


-16.94 
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Pdt and the single best accuracy Pb versus the (sorted by Pdt) number of 
splits for experiment ^ 3. 




Sorted Pdt (the black line) and 
Pb (the grey line) for experiment 
(4, 3, 3) and linear individual clas- 
sifiers. Pdt > Pb in 86.7 % of the 
cases. 



0.94 




0.91 t""; 



0.9 



0 0.05 0.1 0.15 0.2 

b w 

Scatterplot of Pm ax versus Pb — Pw 
for (4, 3, 3) and linear individual 
classifiers. The correlation between 
the two is —0.59. 



Fig. 2. Illustration of the results 



3. The last columns in the subtables in Tabled show the correlation between 
Pb — Pw and Pt- Figure El (right) displays an example of the relationship 
between Pot and Pb — Pw- 

4. A Two-way ANOVA was run to estimate whether there is a significant dif- 
ference between the 9 combination schemes. The test found significant differ- 
ences between the means of the team accuracies computed by the 9 schemes. 
The means with the 95 % confidence intervals from the (4, 4, 2) experiments 
are shown in Figure 0and from the (4, 3, 3) experiments, in Figure 0 

4 Conclusions 

1. How is Pt — Pb distributed and what are the maximal and the minimal possible 
values for different combination schemes ? 

The difference between the team accuracy and the best individual shows a sta- 
ble pattern. In all experiments the accuracy increases by a few per cent. The 
maximum of the max(P( — Pb) in Tabled is the Decision Template combination 
method with 4.21 % for linear classifiers for both (4, 4, 2) and (4, 3, 3). All mini- 
mal values of Pt — Pb are negative indicating that there is no combination scheme 
(at least not among the studied ones) that guarantees improvement over the sin- 
gle best classifier. The combination schemes with the worst negative result are 
the BKS and the Wernecke’s method (up to —8.42 % for BKS). BKS is known 
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(4, 4, 2), linear 



0.925 r 
0.92 - 
0.915 - 
0.91 - 
0.905 - 
0.9 

0.895 - 



+dt. 



^NB 



^MAJ. 



^WER 



feMAX *MIN 



^AVR ^PRO 



(4,4,2), quadratic 



0.916 
0.914 
0.912 
0.91 
0.908 

01 23456789 10 

Fig. 3. Means and the 95 % confidence intervals for experiment (4, 4, 2) . 



jBKS 



jMAX jMIN : 

■ jAVR ± 



jMAJ ji'' 



for being prone to overtraining, so the result is not surprising. In general, the 
classification accuracy of the individual classifiers is around 90 %, so we cannot 
expect substantial improvement from the combination. The differences have ap- 
proximately normal distributions for all combination schemes, a typical example 
is shown in the left plot in Figure Q Results much worse than the single worst 
classifier are unlikely, as shown in the penultimate columns in Table ^ to which 
have very small (often zero) values. However, the distribution is not normal as 
is shown by the typical example in the right plot in Figure Q 

2. How likely is an improvement {Pt — Pb > 0) if we pick a random partition of 
the set of features? 

The numerical answers to this question are given in the fourth column of the 
subtables in Table Dl for the experiments we carried out. However, we cannot 
offer a clear-cut conclusion. A persistent pattern is that the percentage getting 
an improvement over the single best dramatically depends on the quality of 
the individual classifiers. For the (weak) linear models, the improvement is more 
often encountered whereas for the quadratic models the chance for improvements 
are halved. For example, the DT combination has a chance of about 85 % (Table 
n bottom left subtable) to improve on the single best linear classifier if the 
feature set is split randomly into subsets of 4, 3, and 3 features. For the same 
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(4, 3, 3), linear 



0.945 

0.94 

0.935 

0.93 

0.925 

0.92 

0.915 









. . . . +DT 






^MAX ^MIN 


+AVR 


+MAJ 










*BKS 


i/FR 





01 23456789 10 

(4,3,3), quadratic 




Fig. 4. Means and the 95 % confidence intervals for experiment (4, 3, 3) . 



split, when quadratic classifiers are used, none of the schemes has more than 
about 30 % chance for improvement (Table Q bottom right subtable) . 

Perhaps we can conjecture that this chance depends on the problem (how 
complex it is), the classifier model (weak or strong), the number of features used 
per classifier, and more factors which we did not examine here, e.g., the number 
of classifiers L. 

3. Is the team accuracy Pt related to the range Pi, — P^ of individual accuracies? 
The correlation coefficients, and the scatterplot in Figure El (right) do not indi- 
cate an unequivocal relationship. As the correlation coefficients tend to be small 
by absolute value, there is little evidence that the more similar the individual 
accuracies the higher the improvement. 

4. How do the combination schemes compare with respect to the answers to the 
previous three questions? 

Again, the combination schemes exhibit variable performance, and this shows 
that: (a) there is no “best” combination for all scenarios, and (b) building a clas- 
sifier team that outperforms the single best individual is a delicate job. Based 
on our results, we nominate the Decision Templates as the most successful com- 
bination scheme in our experiments. 
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Abstract. Using an ensemble of classifiers instead of a single classifier 
has been shown to improve generalization performance in many machine 
learning problems [4, 16]. However, the extent of such improvement de- 
pends greatly on the amount of correlation among the errors of the base 
classifiers [1,14]. As such, reducing those correlations while keeping the 
base classifiers’ performance levels high is a promising research topic. 
In this paper, we describe input decimation, a method that decouples 
the base classifiers by training them with different subsets of the input 
features. In past work [15], we showed the theoretical benefits of input 
decimation and presented its application to a handful of real data sets. 
In this paper, we provide a systematic study of input decimation on syn- 
thetic data sets and analyze how the interaction between correlation and 
performance in base classifiers affects ensemble performance. 



1 Introduction 

Using an ensemble of classifiers instead of a single classifier has been repeatedly 
shown to improve generalization performance in many machine learning prob- 
lems [4, 16]. It is well-known that, in order to obtain such improvement, one 
needs to simultaneously maintain a reasonable level of performance in the base 
classifiers that constitute the ensemble and reduce their correlations. There are 
many ensemble methods that actively promote diversity (i.e., lower correlations 
in the outputs) among their base classifiers. Bagging [4], boosting [7], and cross- 
validation partitioning [9, 14] generate diverse base classifiers by training with 
different subsets of the training set. Error-correcting output codes [5] generate 
new training sets with different class labels and use these different training sets 
to generate base classifiers. Merz [10] use Principal Component Analysis [8] to 
measure the correlations among the base models and combine them accordingly. 
Dietterich [6] combines decision trees in which each test is chosen at random 
among the 20 best tests. 
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Most work in this field, however, focuses on pattern-level selection (e.g.. Bag- 
ging, Boosting). Input Decimation (ID) on the other hand is a feature selec- 
tion method that generates different subsets of the input features for each of the 
classifiers in the ensemble. By training each base classifier with a different fea- 
ture subset, the correlations among the base classifiers are reduced. (Note that 
input decimation can be used in conjunction with pattern-based ensemble meth- 
ods such as bagging and boosting, as discussed in Section 0) Input decimation 
is different from most other dimensionality reduction methods that are widely 
used, including PCA, in that it generates different feature subsets for different 
classifiers. On the other hand, PCA aims to maximize the variability among the 
newly constructed features, but makes no provisions on how that variability is 
related to class information (see [11] for details). 

In this work we explore using class information to reduce the dimensionality 
of the feature space presented to each base classifier. While strong ensemble 
performance was expected, input decimation also provided improvements in the 
base classifiers by pruning irrelevant features, thereby simplifying the learning 
problem faced by each base classifier. Consequently, Input Decimated Ensembles 
(IDEs) significantly outperformed both base classifiers trained on the full feature 
space as well as ensembles of such classifiers. In the next section we briefly 
highlight the need for correlation reduction in ensembles. We then present the 
input decimation algorithm, along with results on synthetic data sets. 

2 Correlation and Ensemble Performance 

In this article we focus on classifiers that model the a posteriori probabilities of 
the output classes. Such algorithms include Bayesian methods [3], and properly 
trained feed forward neural networks such as Multi-Layer Perceptrons (MLPs) 
[12]. We can model the ith output of such a classifier as follows (details of this 
derivation are in [13, 14]): 



where P{Ci\x) is the posterior probability of the Ah class given instance x, and 
rji{x) is the error associated with the zth output. Given an input x, if we have 
one classifier, we classify x as being in the class i whose value fi{x) is largest. 

Instead, if we use an ensemble that calculates the arithmetic average over 
the outputs of N classifiers f^{x) , m G {1, . . . , A^}, then P{Ci\x) is given by: 



fi{x) = P{C^\x) + r]i{x), 




( 1 ) 



m—1 




m—1 



where: 
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and r]™{x) is the error associated with the ith output of the mth classifier. 
Now, the variance of fji{x) is given by [14]: 



1 



N 

^2 _ ^ \ " 2 
m—1 



1 



N 



W + ]^ 2^ cov{-n^{x),r,r{x)). 



m—1 l^m 



If we express the covariances in terms of the correlations {cov{x,y) = 
corr{x,y)uxCFy), assume the same variance across classifiers, and use the 
average correlation factor among classifiers, 5i, given by 

1 ^ 

^ NiN -1) ^ ^ corr(7?'(x),77™(x)), (2) 

' m—1 l^m 



then the variance becomes: 




iV- 1 
N 



Sia 



2 

m(x) 



1 + S,{N-1) 
N 



a 



2 



( 3 ) 



Based on this variance, we can compute the variance of the decision boundary 
and, generalizing this result to the classifier error, we obtain the relationship 
between the model error (beyond the Bayes error) of the ensemble 
that of an individual classifier [13, 14]: 



T^ave 

^model 



l + <5(iV-l) 

' N “ 



E, 



model 



( 4 ) 



where 



L 

5 = Y,PA (5) 

i=l 

and Pi is the prior probability of class i. 

Equation Elquantifies the connection between error reduction and the correla- 
tion among the errors of the base classifiers. This result leads us to seek to reduce 
the correlation among classifiers prior to using them in an ensemble. In the next 
section we present the input decimation concept which merges dimensionality 
reduction and correlation reduction to provide classifier ensembles. 

3 The Input Decimated Ensembles 

Input decimation decouples the classifiers by exposing them to different aspects 
of the same data by selecting features most correlated with a particular class. 
ID trains L classifiers, one corresponding to each class in an L-class probleirQ. 
For each classifier, the method selects a user-determined number of the input 

^ More generally, one trains nL classifiers where n is an integer. 
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features having the hipest absolute correlation to the presence or absence of 
the corresponding clasfl The objective is to “weed” out input features that do 
not carry strong discriminating information for a particular class, and thereby 
reduce the dimensionality of the feature space to facilitate the learning process. 

Let the training set take the following form: 



{(xi,yi),(x 2 ,y 2 ),--.,(x m? ym)} ) 

where m is the number of training examples. Each Xi has HES'll elements (where 
FS is the set of input features) representing the values of the input features in 
example i. Each yi represents the class using a distributed encoding, i.e., it has 
L elements, where L is the number of classes, yu = 1 if example i is an instance 
of class I and yu = 0 if example i is not an instance of class 1. In this study our 
base classifiers consist of MLPs trained with the backpropagation algorithrr0. 

Given such a data set, and a base classifier learning algorithm, input deci- 
mated ensembles operate as follows: 

— For each class Z S {1, 2, . . . , L}, 

1. Compute the absolute value of the correlation between each feature j 
(xij for all patterns i) and the output for class I (yii for all patterns *). 

2. Select the ni features having the highest absolute correlation, resulting 
in new feature set FSi. One can either predetermine ni based on prior 
information about the data set, or learn the value to optimizes perfor- 
mance. 

3. Construct a new training set by retaining only those elements of the Xi’s 
corresponding to the features FSi and all the outputs. 

4. Call the base classifier learning algorithm on this new training set. Call 
the resulting classifier fK 

Given a new example x, we classify it as follows: 

— For each class k G {1, 2, ... , L}, calculate f^'’^{x) = t>y pre- 

senting the proper feature sets (FSi) to each of the L classifiers. 

— Return the class K = argmaxkfk'’^{x). 

Fundamentally, input decimation seeks to reduce the correlations among in- 
dividual classifiers by using different subsets of input features, while methods 
such as bagging and boosting attempt to do so by choosing different subsets 
of training patterns. These facts imply that input decimation is orthogonal to 
pattern-based methods such as bagging and boosting, i.e., one can use input 
decimation in conjunction with pattern-based methods, and directly comparing 

^ Note that this method requires the problem to have at least three classes. In a 
two-class problem, features strongly correlated with one class will be strongly anti- 
correlated with the other class, so the same features would be chosen for both clas- 
sifiers. 

^ In principle, any learning algorithm that estimates the a posteriori class probabilities 
can be used. 
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input decimation to bagging or boosting serves little purpose. Rather one should 
compare input decimated ensembles to original ensembles (which is done here) 
or input decimated, bagged ensembles to bagging alone (which we are currently 
investigating) . 

4 Experimental Results 

In this section, we present the results of input decimation on synthetic datasets. 
As discussed above, our base classifiers are multi-layer perceptrons. In this work 
all such classifiers contain a single hidden layer and the learning rat^ momentum 
term, and number of hidden units were experimentally determinecJS 

As a standard against which to compare our input decimation results, we also 
trained a classifier on the full feature set (referred to as the “original single clas- 
sifier”) and separately trained L copies of the same classifier and incorporated 
them into an averaging ensemble (referred to as the “original ensemble” ) . As an- 
ticipated, the original ensemble often performs significantly better than each of 
its base classifiers. Comparing input-decimated ensembles with these original en- 
sembles isolates the benefits of removing input features from the base classifiers. 
Because PCA is a standard dimensionality reduction method, we also compare 
input decimated ensembles to PCA ensembles (i.e., ensembles where each con- 
stituent classifier was trained on a preselected set of the principal components 
of the feature space) . 

In these experiments, we used the following three synthetic datasets: 

— Set 1: 

• Three classes-one unimodal Gaussian per class. 

• 300 training patterns and 150 test patterns-100 training and 50 test 
patterns per class. 

• 100 features per pattern where there are: 

* 10 relevant features per class-each class’s instances are generated 
from a multivariate normal distribution in 10 independent dimen- 
sions distributed as A^(40, 5^). There are no dimensions in common 
among the three classes. Therefore, there are 30 relevant features. 
For instances of each class, the 20 features that are relevant to the 
other two classes are distributed as t/[— 100, 100] 0 
■k 70 irrelevant features-distributed as C/[— 100, 100]. 

— Set 2: Same as Set 1, except that only 50 irrelevant features were added to 
the 30 relevant features, for a total of 80 features in the dataset. 

— Set 3: Same as Set 1, except that there is overlap among the relevant features 
for each class (e.g., classes have three relevant features in common). 

We experimented on a single neural network with all input features by trying learning 
rates and momentum terms in increments of 0.05 and hidden units in increments of 
5 until the performance began to decline. 

® Clearly, because of this, all 30 features have some relevance to all three classes; how- 
ever, the 10 features used to generate each class’s instances are clearly substantially 
more relevant than the other 20 features. 
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Synthetic Dataset 1 Performances 




Number of Features 



Fig. 1. Dataset 1 Performances 



Synthetic Dataset 1 Correlations 




Number of Features 



Fig. 2. Dataset 1 Correlations 



In dataset 1 there is an abundance of features that are irrelevant for the clas- 
sification task. This data set was chosen to represent large data mining problems 
where the algorithms may get swamped by irrelevant data. Dataset 2 has fewer 
irrelevant features and was chosen to illustrate the performance of input deci- 
mation as a function of irrelevant information present in the feature space. By 
reducing the amount of noise in the feature space, the problem is subtly mod- 
ified: selecting the relevant features is now easier, but the effect of removing 
the irrelevant features on the base classifiers’ performance is reduced. Finally, 
dataset 3 was chosen to have overlap among the features relevant to each class. 
This provides a more difficult problem where the base classifiers are now forced 
to select some common features, reducing the potential for correlation reduction. 



4.1 Synthetic Set 1 

Figures 0 and |5| present the classification accuracies and base classifier corre- 
lations, respectively as a function of the number of inputs (which are either 
the number of selected principal components or the number of features selected 
for each base classifier through input decimation). The original single classifier 
and original ensemble use all the input feature^. The points for the maximum 
number of features (e.g., 100 features in this dataset), always represent the per- 
formance of the original classifier/ensemble. 

An important observation that is apparent from these results is that neither 
PCA ensembles nor PCA base classifiers are particularly sensitive to the number 
of inputs. The correlations among the base classifiers reinforce this conclusion. 
Fewer input features in PCA means the base classifiers are more correlated since 
they all share the same principal features. Note however, that input decimated 
base classifiers have little correlation for small numbers of features, increasing 
correlation up to 30 features, and decreasing correlation after that. The base 
classifiers’ average performance follows a similar pattern. Interestingly though, 

® The base classifier used was an MLP with a single hidden layer consisting of 95 units, 
trained using a learning rate of 0.2 and a momentum term of 0.5. 
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input decimated ensembles are not adversely affected by the poor performance of 
the base classifiers (e.g., input decimated ensembles with 5 features outperformed 
input decimated ensembles with 50 features while base classifiers with 5 features 
gave significantly worse results than base classifiers with 50 features). 

In cases where more than 30 features were used, the performance of the 
ensemble declined with the addition of additional features, i.e., as more and more 
irrelevant features were included. However, all the input decimation ensembles 
provided statistically significant improvements over the original ensembles and 
PCA ensembles. 

The single decimated classifiers with 20 and more features outperformed 
the original single classifier. This perhaps surprising result (as one might have 
expected only the ensemble performance to improve when using subsets of the 
features) is mainly due to the simplification of the learning tasks, which allows 
the classifiers to learn the mapping more efficiently. 

Interestingly, the average correlation among classifiers does not decrease un- 
til a very small number of features remain. We attribute this to the removal 
of noise — removing noise increases the amount of information shared between 
the base classifiers. Indeed, the correlation increases steadily as features are re- 
moved until we reach 30 features (which corresponds to the actual number of 
relevant features). After that point, removing features reduces the correlation 
and the individual classifier performances. However, the ensemble performance 
still remains high. This experiment clearly shows a typical trade-off in ensemble 
learning: one can either increase individual classifier performance (as for input 
decimation with more than 30 features) or reduce the correlation among classi- 
fiers (as for input decimation with less than 20 features) to improve ensemble 
performance. 

4.2 Synthetic Set 2 

Figures 0 and 0 present the classification accuracies and base classifier corre- 
lations, respectively, for the second data set which is obtained by reducing the 
number of irrelevant features (from 70 to 50) from the first dataselQ The dec- 
imated ensembles with 5 and 70 features marginally outperformed the original 
ensemble and PCA-based ensemble, while the remaining ones performed signifi- 
cantly better. Note that, just as it was for the first data set, the input decimated 
single classifiers with 20 or more features outperformed the single original clas- 
sifier. This demonstrates that if the feature set is noisy (an assumption that 
almost always holds in the real world) improvements are achieved through di- 
mensionality reduction alone. 

4.3 Synthetic Set 3 

Figures 0 and El present the results for the third data set, which is similar to 
the first dataset except that there is overlap among the relevant features for the 

^ The single classifier used was an MLP with a single hidden layer consisting of 65 
units, trained using a learning rate of 0.2 and a momentum term of 0.5. 
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Synthetic Dataset 2 Performances 



Synthetic Dataset 2 Correlations 





Fig. 3. Dataset 2 Performances 



Fig. 4. Dataset 2 Correlations 



Synthetic Dataset 3 Performances 




Number of Features 



Synthetic Dataset 3 Correlations 
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Fig. 5. Dataset 3 Performances 



Fig. 6. Dataset 3 Correlations 



classes]! Because of this overlap, this feature set has fewer total relevant features 
and thus it constitutes a more difficult problem (as indicated by comparing the 
results on the full feature classifiers and ensembles on this dataset to the previous 
ones). 

Note that the correlations in this data set remained fairly constant across 
the board. Unlike results shown in Figure input decimation did not reduce 
correlations dramatically for small feature sets. This is mainly caused by the 
“coupling” among the features (i.e., the presence of features that are essential 
to many classes due to the overlap). 

In spite of these difficulties, input decimation ensembles perform extremely 
well. Indeed, they significantly outperform both the original ensemble and PCA 
ensembles on all but a few subsets where they only provide marginal improve- 
ments. Furthermore the input-decimated single classifiers also outperform their 
original and PCA counterparts for all but the 60 and 70 feature subsets. This is 
particularly heartening since this feature set is a more representative abstraction 

® The single classifier used was an MLP with a single hidden layer consisting of 95 
units, trained using a learning rate of 0.2 and a momentum term of 0.5. 
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of real data sets (data sets with “clean” separation among classes are quite rare). 
This experiment demonstrates that when there is overlap among classes, class 
information becomes particularly relevant. PCA operates without this vital in- 
formation, therefore it cannot provide any statistically significant improvements 
over the original classifiers and ensembles. 

5 Discussion 

This paper discusses input decimation, a dimensionality reduction-based en- 
semble method that provides good generalization by reducing the correlations 
among the classifiers in the ensemble. Through controlled experiments, we show 
that the input decimated single classifiers outperform the single original clas- 
sifiers (trained on the full feature set), demonstrating that simply eliminating 
irrelevant features can improve performanc^. In addition, eliminating irrelevant 
features in each of many classifiers using different relevance criteria (in this 
case, relevance with respect to different classes) yields significant improvement 
in ensemble performance, as seen by comparing our decimated ensembles to 
the original ensembles. Selecting the features using class label information also 
provides significant performance gains over PCA-based ensembles0 

Through our tests on synthetic datasets, we examined the characteristics 
that datasets need to have to fully benefit from input decimation. We observed 
that input decimation performs best when (i) there are a large number of fea- 
tures (i.e., where it’s likely that there will be irrelevant features); and (ii) when 
the number of training examples is relatively small (i.e., where it’s difficult to 
properly learn all the parameters in a classifier based on the full feature set). 
In both cases, by removing the extraneous features, input decimation reduces 
noise and thereby reduces the number of training examples needed to produce 
a meaningful model (i.e., alleviating the curse of dimensionality). Our synthetic 
datasets were generated using multivariate distributions where the feature val- 
ues were generated independently. We plan to generate synthetic datasets with 
dependencies among the features to see how they affect our method. 

Note that input decimation shares the central aim of generating a diverse 
pool of classifiers for the ensemble with many methods, and most notably with 
bagging. However, by focusing on the input features rather than the input pat- 
terns, input decimation focuses on a different “axis” of correlation reduction 
than does bagging. Consequently, input decimation is orthogonal to bagging, 
and one can use input decimation in conjunction with bagging. 

A final observation is that input decimation works well in spite of our rather 
crude method of feature selection (i.e., using statistical correlation of each fea- 
ture individually with each class). One reason why this simple method succeeds 

® Although this result is perplexing from an information theory perspective, it is con- 
sistent with learning theory: by removing features we simplify the learning task and 
thus allow the base classifiers to reach their “peak” performance. 

Furthermore, IDEs also outperform random feature subset selection [2, 17] on real 
datasets [15]. 
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is that we have greatly simplified the relevance criterion: unlike other feature 
selection methods that consider the discriminatory ability across all classes, we 
only consider the relevance of the features to a single class. This typically causes 
each classifier in the ensemble to get a different subset of features, leading to 
the superior performance we have demonstrated. Nevertheless, we are currently 
extending this work in three directions: considering cross-correlations among the 
features; investigating mutual information-based relevance criteria; and incorpo- 
rating global relevance into the selection process. 

Acknowledgments. Part of this work was done while Nikunj Oza was visiting 
NASA Ames Research Center. 
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Abstract. A mathematical analogy between the process of multiple ex- 
pert fusion and the tomographic reconstruction of Radon integral data is 
outlined for the specific instance of the combination of classifiers contain- 
ing discrete data sets. Within this metaphor all conventional methods of 
classifier combination come, to a greater or lesser degree, to resemble 
the unfiltered back-projection of the constituent classifiers’ probability 
density functions: an implicit attempt to reconstruct the PDE of the 
composite pattern space. In these probabilistic terms, the combination 
of classifiers with identical feature-sets correspondingly constitutes an 
attempt at morphological manipulation of the composite pattern-space 
PDE. A consideration of the separate benefits of combination along these 
dualistic lines eventually leads to an optimal strategy for classifier com- 
bination under arbitrary conditions. 



1 Introduction 

We present a metaphor for classifier combination in terms of the apparently 
unrelated process of the reconstruction of Radon integral data via tomographic 
means. By interpreting the combination of classifiers with distinct feature sets 
as the implicit reconstruction of the composite pattern space probability density 
function (PDF) of the entire space of features, we can begin to envisage the 
problem in geometric terms, and, ultimately (although beyond the scope of this 
paper) propose an optimal approach both to this, and later to the more general, 
problem of non-distinct feature sets [cf 5]. 

2 Context of Analysis 

We specify as follows our prior assumptions in relation to conventional combi- 
natorial schemes (generalising later to a less constricting set of assumptions): 

1. It is assumed, at least initially, that the selection of features is decided 
through classifier preference, and that this is accomplished via the straight- 
forward omission of superfluous dimensions as appropriate. 

2. For simplicity, it shall (at least at the outset) be assumed that the set of 
classifiers operate on only one feature individually, and that these are dis- 
tinct (though note that the former is not a prerequisite of the method). 



J. Kittler and F. Roli (Eds.): MCS 2001, LNCS 2096, pp. 248-^^^ 2001. 
@ Springer- Verlag Berlin Heidelberg 2001 



Classifier Combination as a Tomographic Process 249 



Evidence that the stronger of these two assumptions, the latter, is reason- 
ably representative of the usual situation comes from [1], wherein features 
selected within a combinatorial context are consistently shown to favour the 
allocation of distinct feature sets to the constituent classifiers. 

3 . We shall consider that the construction of a classifier is the equivalent of 

estimating the PDFs - Vz, where M{x,y) is 

the final set of feature dimensions passed from the feature selection algorithm 
for class y (the cardinality of which, ki, we will initially set to unity for every 
class identified by the feature selector: ie = 1 Vz). 

4 . It is assumed that in any reasonable feature selection regime the total set 
of features employed by the various classifiers exhausts the classification 
information available in the pattern space (ie, the remaining dimensions 
contribute only a stochastic noise component to the individual clusters). 



Given assumption 3 above (that individual classifiers may be regarded as 
PDFs) and further, that pattern vectors corresponding to a particular class may 
be regarded as deriving from an n-dimensional probability distribution, then 
the process of feature selection may be envisaged as an integration over the di- 
mensions redundant to that particular classification scheme (the discarding of 
superfluous dimensions being, in effect, the linear projection of a higher dimen- 
sional space onto a lower one, ultimately a 1-dimensional space in the above 
framework). That is, for rz-dimensional pattern data of class i: 



p{xk\tOi)dxk 



r r+o° 



r+oo 




p{X\uJi)dxi . . . 



. . . dxk-\dxk-\-\ • . . dxji 



■dxk 



( 1 ) 



with X = (xi,X 2 , • . • , x„) 

Because of condition 4 above (a good approximation when a range of classi- 
fiers is assumed), we shall consider that the pattern vector effectively terminates 
at index j, where j < rz is the total number of features (and also classifiers, given 
condition 3). That is, X = (xi,X2, . . . ,Xj) now represents the extent of the pat- 
tern vector dimensionality. In the integral analogy, the remaining dimensions 
that are integrated over in equation 1 serve to reduce the stochastic component 
of the joint PDF by virtue of the increased bin count attributable to each of the 
pattern vector indices. 



3 The Radon Transformation 

Now, it is the basis of our thesis that we may regard equation 1 as the j-dimension 
analogue of the Radon transform (essentially the mathematical equivalent of 
the physical measurements taken within a tomographic imaging regime), an 
assertion that we shall make explicit in section 5 after having found a method 
for extending the inverse Radon transform to an arbitrarily large dimensionality. 
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The conventional Radon transform, however, is defined in terms of the two- 
dimensional function f{x,y) as follows; (after [2]) 

/ -l-oo ^-1-00 

/ f{x,y)S{s-xcose-ysine)dxdy 

-OO J — OO 

{=g{s,9)) (2) 

where s may be regarded as a perpendicular distance to a line in (x,y) space, 
and 0 the angle that that line subtends in relation to the x axis. R{s, 0) is then 
an integral over f{x, y) along the line specified. 

As a first approximation to inverting the Radon transform and reconstructing 
the original data f{x,y), we might apply the Hilbert Space adjoint operator of 
R{s,0), the so-called back-projection operator: 

R*[i?(s,6»)](a?) = [ R{e,e-x)d0 (3) 

Js 

with X = {x,y),6 = (cos 0, sin 0) 

To appreciate how this operates, consider first the following identity written 
in terms of the arbitrary function v, where V = R*v: 



( f v{0,x ■ 6 — s)g{0,s)ds d0 

Js Js 

= ( fv{0,x-6 — s) ( f{x')S{s — x' ■ 6)d‘^x' dsd0 

Js Js JtZ'^ 

— [ [ v{0,x ■ 9 — x' ■ 6)f{x')d^x'd0 

Js Jr'^ 

(eliminating s) 

= / [ ~ 

Jn'^ 

= f V{x — x')f{x')d^x' 

Jtz'^ 

(via the definition of V [= R*v] ) 

= V*f 



f{x')d‘^x' 



( 4 ) 



The first term in the above may be symbolically written R*{v * g), where it 
is understood that the convolution is with respect to the length variable and not 
the angular term in g. Hence, we have that V * f = R*{vk g). 

We may describe the relationship between V and v in terms of their Fourier 
transforms. Consider first the two-dimensional transform of V : 



F{k)[V{x)] = (27t)-i [ e~'-^-^V{x)d'^x 



= (27t)-i [ 6-“ '^ [ 
Jw^ Js 



v{0, X ■ 6)d^x d0 



( 5 ) 
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(by substitution) 

= [ [ e-^^ '^v{9,x-6)cPxde 

Js J-R? 

We now consider a slice through this transform along the direction 9. This 
may be accomplished in the above by substituting in the delta function 6{k — a9) 
within the 9 integral (ie coupling the variables k and 9) and transforming it to a 
k space integral via the corresponding transformation dk — >■ ad9 {a is a positive 
real number): 



F{a9)[V{x)] 

= ( f e~^^''"v{9,x ■ 9)d^x 6{k — a9)d9 

Js Jtz^ 

= {2tt)~^ ( f e~^^ '"v{9,x ■ 9)d^x 6{k — a9)dka~^ 
Js Jn^ 

= {2tt)~^ f e~^'^^'^v{9,x ■ 9)d^xa~^ 

J-R? 

We have also that d{x ■ 9) = dx ■ 9 for constant 9. Thus: 



F{a9)[V{x)] 

= {2tt)-^ [ X ■ 9)d{x ■ 9){(j\9\)-^ 

Jr.^ 

= ( 27 t )“^ / e~'^'’^v{9,z)dz{a\9\)~^ 

Jr.'^ 

(where z = x ■ 9) 

The 2 dependent terms now form a Fourier transform with respect to the 
second variable in v. Hence, we may write the above in the following form to 
elucidate the precise relation between V and v in Fourier terms: 

F{a9)[V{x)] = {27r)-^F^{a)[v{9,z)]{a\9\)~^ (6) 

The effect of the back-projection operator on the Radon transform of / may 
then be appreciated, via a consideration of equation 4, by setting u to be a 
Dirac delta function in s (corresponding to an identity operation within the 
convolution). The V corresponding to this v may then be deduced by inserting 
the Fourier transform of the delta function (unity throughout /-space) into the 
above equation. Hence, we see that the effect of applying the back-projection 
operator to the Radon transformed / function is the equivalent of convolving / 
with the inverse Fourier-transformed remainder: 



/recovered(^5 y) — /original ^ F (s ) 



(7) 
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In terms of the tomographic analogy, we retrieve a ’’blurred” version of the 
original data. In fact, the object of tomography is exactly the reverse of this 
process: we seek to obtain a v function such that it is V that approaches the 
form of the delta function: that is, transforming the RHS of equation 4 into / 
alone. In this instance, we may regard the v function as a ’’filtering operator” 
that serves to remove morphology attributable to the sampling geometry rather 
than the original data, which is then hence applied to the Radon data at a stage 
prior to inversion via the back projection operator. 

We shall in section 5 set out to show that the summation method of clas- 
sifier combination (which is representative of many more generalised combina- 
tion approaches under certain conditions, such as very limited class information 
within the individual classifiers) is, in effect, the equivalent of applying the back- 
projection operator immediately to the classifier PDFs (which in our analogy are 
to be considered Radon transforms), without any attempt to apply prior filtering 
(ie, setting v to the delta function in equation 4). It is then via this observa- 
tion that we ultimately hope to improve on the combination process, achieving 
an optimal, or near optimal solution to the inversion problem by finding an 
appropriate filter, v, albeit in the context of probability theory. 

Prior to setting out this correspondence we shall first extend the method to 
the j-dimensions required of our pattern vector, and illustrate how the mechanics 
of the Radon reconstruction might be applied within the current context. 



4 Af-Dimensional Generalisation of the Radon Transform 

We can show that there exists [cf 5] a discretised (n— l)-to-n-dimensional gener- 
alisation of both the inverse Radon transformation and deblurring mechanisms, 
which, to take the three dimensional instance of the latter for the two-angular- 
sample spaces implicit in the probabilistic geometry of feature selection, has the 
form: 



'5a/3/)R(^a/30j 4” 

4 =—o ' ^ct/31 ^a0l' )R(^a/31)'^ct/3/d 4 ” 

Q:/3 ^ 

4 = — ' ^ 7/31 ' 37 / 3 /^ 0 -^(^ 7/31 i ^ 7 / 3/ '0 

7/3 ^ 

V a, j3 : a, /3 e X; a ^ j3; 0 < a, /3 < n, (8) 

(the subscript f2 appended here to indicate a bandwidth limitation at- 
tributable to the very low number of angular Radon samples implied by the 
orthogonal nature of the feature space integrations: cf [5] for a full specification 
of this term, and a description of its implication for the maximal information 
content of the reconstructed space). 

The Greek subscripts are then feature labels, and the numeric subscripts 
are the angular sample indices within the (hyper)-plane specified by the feature 
indices: detailed derivation of the value of the multiplication constant. A, and 
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the trivial normalising assumptions required in allocating the same A to each 
convolution are referred to [5] . 

We can therefore demonstrate that the unfiltered volumetric inverse Radon 
transformation (that is, the three-dimensional back-projection operator applied 
to the two-dimensional ’’facet” integrals obtained from the feature selection pro- 
cess) therefore constitutes a linear summation of the form: 



^ ' ^ct/3o) A -R(^a/3l5 ^ ' ^ct/3l) A ^ ' ^7/3o)] (9) 

When generalised to an (n — l)-to-n dimensional inverse Radon transfor- 
mation this formula can then be applied recursively to generate the full N- 
dimensional pattern-space PDF from component classifier feature-spaces of ar- 
bitrarily small dimensionality (equal to unity in our case, given the specifications 
at the outset), the recursion retaining this linearity of summation for both the 
inverse Radon transformation and geometry-filtering procedures. 

5 Correspondence with Classifier Combination Theory 

Having obtained a form (or rather, a method) for n-dimensional inverse Radon 
transformation, we are now in a position to make the correspondence with clas- 
sifier combination theory more formal. That is, we shall seek to encompass the 
various extant combinatorial decision theories within the tomographic frame- 
work that we have developed over the preceding sections, and show that they 
represent, within certain probabilistic bounds, an imperfect approximation to 
the unfiltered inverse Radon transformation. 

We will firstly, however, demonstrate how we might explicitly substitute 
probabilistic terms into equation 8, and therefore, by extension, the complete 
n-dimensional inverse Radon transformation. We have initially then to establish 
exactly what is meant in geometrical terms by the Radon forms upon which 
equation 8 is constructed. It is helpful in this endeavour to, at least initially, 
eliminate the complication of the pre-filtering convolution represented by v, and 
therefore we consider only equation 9. 

However, recall from equation 2 that: 



R{0,s)[f{x\,X2)] 

f+ao f+oo 



/ -l-oo ^-1-00 

/ f{x'i,X 2 )S{s — x'l COS 0 — x '2 sin 9)dx\ dx '2 

-00 J —00 

{=g{s,9)) (10) 

Now, in explicitly making the feature geometry congruent with the Radon 
geometry, we also have that; 



cos 02,2 = sin = 0 
and 

cos02:i = sin02:2=l- (11) 
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(0 now being measured in relation to the xi axis) 
Thus, for example, picking an ordinate at random: 



f+oo p+oo 



R{6^^,xi)= I I f{xi,X2)S{xi - x[)dx[ dx'2 

J —00 

f{xi,X2) dx'2 
f{xi,X2) dX2 



' —00 J —00 
/•+00 



' —00 
/*+oo 



( 12 ) 



and similarly for X2, X3 

Now, a rational extension of the nomenclature of equation 1 would allow us 
to write: 



p{xi,X2\uj^)dxk 

p+00 p+00 



p{X\u>i)dx3 . . . dxR.dxidx2 



( 13 ) 



(and similarly for the remaining pairs of basis vector combinations) 
We, of course, still have that: 



p{xi\LOi)dxk 




r+00 



p{X\u>i)dx2 ■ ■ • dxR.dxi 



r+00 



p{xi,X2\uJi)dX2 



( 14 ) 



Thus, by setting the equivalence f{xi,X2) = p{xi,X2\t0i), we find by direct 
substitution into equation 12 that we may state that: 

/ +00 

f{xi,X2) dX2 = p{xi\LOi) ( 15 ) 

-00 

and similarly for the remaining numeric subscripts. 

Hence, in consequence, we may simply restate the unfiltered two-to-three 
dimensional inverse Radon transformation in the more transparent form: 



A[p{xi\uJi) + p{x2\uJi) + p{x2\u},)] ( 16 ) 

Moreover, we can again go further and extend this approach to the recursive 
methodology of the n-dimensional inverse Radon transformation, in which case 
we find, in the most general terms, that the unfiltered n-dimensional inverse 
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Radon transformation will have the form: (declining explicit calculation of the 
various normalising constants corresponding to A in the above, this being a 
relatively complex undertaking, and not in any case required in the context of 
the decision making schemes within which the method will ultimately be applied 
[see later]) 



, (17) 

all k 

which clearly comes to resemble the Sum Rule decision making scheme (a 
correspondence we shall make formal later). 

The substitution of probabilistic terms into the generalised inverse Radon 
transformation having thus been rendered explicit, it is now an elementary mat- 
ter to substitute the previously omitted filtering function vq back into equation 
17 (the various subscript redundancies induced by an appropriate selection of 
the coordinate system above applying equally to the variable s in equation 8), 
most particularly since the set of filtering convolutions will remain additive in 
relation to their correspondant p{xk\uJi) functions throughout the recursive in- 
crement in dimensionality, and will therefore readily generalise to a composite 
n-dimensional filtering function. (We omit a discussion of its specific form since 
this is entirely dependent on the choice of vo). 

Having transcribed the inverse Radon transform into purely probabilistic 
terms and eliminated any residual geometric aspects of the problem, we may now 
turn to an investigation of how the n-dimensional reconstruction relates to the 
decision making process implicit within every regime of classifier combination. 

As a preliminary to this endeavour, we must firstly ensure that there ex- 
ist comparable pattern vectors for each class PDF (such not necessarily being 
the case for feature sets constructed on a class-by-class basis, as within our 
approach). That is, we shall need to ensure that: 

=p{xi^,...,xujujk) yi,k (18) 

where Uk and Ik are, respectively, the highest and lowest feature indices of 
the various feature sets involved in the combination, and ^ is the cardinality of 
the feature set corresponding to the fcth class and ith classifier: Ri{nk,i) is then 
the nth highest feature index in the feature set presented to the ith classifier for 
computation of class PDF number k. 

This may be straightforwardly accomplished by the inclusion of null vector 
components, such that: 



= 1 (19) 

implicitly setting Ik to 1 and Uk to N, thereby allowing a universal approach 
for each class index, k 

Now, we have via the Bayes decision rule (ie that we: 



assign X ujj if 
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p{ujj\xi, . ..xn) = itnaxkp{ujj\xi,. ..xn), 



( 20 ) 



given; 



p{xi,...,XN\uJk)p{i^k) 

p(UkXi,...XN) = 7 7 (21) 

p[Xi, . . .,Xn) 

), that our decision rule for unfiltered iV-dimensional inverse Radon PDF recon- 
struction is: 



assign X — >• ujj if 



p{u!j\xi, . . .Xn) = maxfe 



p(xi,...,xn) 



(22) 



(from equation 17) 

The more familiar decision rules, however, may be derived solely via prob- 
abilistic constraints on the Bayes decision rule. For instance, suppose that we 
impose the condition that x\, . . . ,xn are independent random variables (such 
that: 



N 

p{xi, . . .,XN\ix>k) = Xi\uJk) (23) 

), then we obtain the decision rule: 



assign X ujj if 



p{ujj\xi,...XN) = maxfc 



n^Ll Pixi\‘^k)p{<^k) 

p{xi,...,XJi) 



That is, we obtain the classical “Product Rule” . 
If we impose the further constraint that: 



(24) 



p{iVk\x,) = p{u}k)[l + Sf{uJk,x^)] , (25) 

with 5f{uik,Xi) an infinitesimal function (in effect, imposing a high degree 
of “overlap” amongst amongst the total set of class PDFs, or, equivalently, a 
ubiquitous class membership ambiguity), and apply this directly to the Bayes 
theorem for single vectors then we can demonstrate [5] that we obtain the clas- 
sical “Sum Rule” decision scheme: 



assign X — >• u>j if 



p{ujj\xi, . ..Xn) = maxfe 



p{xi....,xr) 



(26) 



This, however, is identical to our original decision rule for the unfiltered in- 
verse Radon transformation. Hence, we may state that the unfiltered inverse 
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Radon PDF reconstruction is, within a Bayesian decision-making context, the 
equivalent of the Sum Rule decision making scheme under the specified proba- 
bilistic constraints, and will thus produce near-optimal results only when the two 
conditions are satisfied (ie that the pattern vector components are statistically 
independent, and that there exists a high class membership ambiguity owing 
to similar PDF morphologies). The unfiltered inverse Radon decision making 
scheme then recreates the Product Rule under the less constrictive (and there- 
fore more common) condition of a high class membership ambiguity alone, a 
condition, however, which must still presuppose very major constraints on the 
A^-dimensional PDF morphology if the equality is to hold. 

Very many other classical combination rules are derived from combinations 
of these preconditions (see [3]) and thus come to resemble, to some degree, the 
unfiltered inverse transform. Without exception, however, they will all impose 
very considerable constraints on the implied Wdimensional PDF reconstruction. 
When viewed in this morphological regard, it is clear that the lack of univer- 
sal application of classical methods of combination, however effective they may 
be within their typical domains of application, is (by an inversion of the above 
process) attributable to these implicit constrictions on the reconstructive pro- 
cess, to which these methods have been shown to offer an approximation. The 
only way in which we can free ourselves of these restrictions (on the assumption 
that we have obtained error-free PDFs [see later]) is then to apply the filtered 
inverse Radon transform in its entirety, since this inherently neither assumes nor 
imposes any morphological (and therefore probabilistic) constraints on the final 
Wdimensional PDF, other than those already implicit in the original PDF data. 

On information-theoretic grounds this would therefore represent an optimal 
solution to the implied problem of Wdimensional PDF reconstruction, it being 
apparent, by an inversion of the arguments above, that at least one aspect of 
every method of classifier combination is in some (not necessarily immediately 
obvious) way, the implicit recovery of an fV-dimensional PDF. 

To be fully confident of this conclusion, we would have to consider whether 
the above argument is modified by the fact that the various classifier PDFs, in 
consequence of having been derived from a finite set of stochastically distributed 
pattern data points, would invariably, to some extent, deviate from the “true” 
(if only hypothetically existent) probability density functions. In fact a detailed 
analysis (see [5]) shows the method to exhibit a robustness to estimation error 
similar to that of the Sum Rule, which it has come to so closely resemble, the 
Sum Rule being the previously optimal combination scheme in this regard: see 
Kittler et al 1997. [4] 

6 Prospect and Summary 

We have thus far considered tomographic reconstruction theory only in terms 
of distinct feature sets: the contrary situation must be addressed if we are to 
arrive at a universal perspective of classifier combination. Before embarking on 
an investigation of this we should, however, reiterate just how exceptional it is to 
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find overlapping feature sets amongst the classifiers within a combination when 
feature selection is explicitly carried out within a combinatorial context (see [1]). 

There would appear, then, to be an apparent double aspect to the functional- 
ity of conventional classifier combination, one facet of which may be considered, 
to the extent that the feature spaces are overlapping, the refinement of PDF 
morphology to improve classification performance (such as via weighted averag- 
ing), and therefore a form of classification in its own right, and the other being 
that of tomographic reconstruction, in so far as the feature sets belonging to 
the classifiers within the combination are distinct. Classical techniques of com- 
bination have tended to conflate these two disparate aspects through not having 
made a rigorous distinction between those classifier combinations that, in effect, 
act as a single classifier and those combinations that may be considered to act on 
entirely distinct orthogonal projections of a single PDF encompassing the whole 
of the A^-dimensional pattern space. We, in contrast, would find it necessary, in 
seeking an optimal solution to the combination problem, to make this distinction 
completely formal. Explicitly separating the two, however, involves reverting to 
a stage prior to combination, and addressing the nature of the feature selection 
process itself. Thus we find we must take a unified perspective on the appar- 
ently separate issues of feature selection and classifier combination if we are to 
fully exploit the potential of the tomographic metaphor in attaining an optimal 
solution to the problem. Full details of just such an approach are set out in [5]. 

It is fully intended that, besides the publication of this completely inclu- 
sive methodology, the findings in relation to an experimental implementation of 
the tomographic combination technique will form the basis of a future series of 
papers. 
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Abstract. We propose a system for a regular updating of land-cover maps 
based on the use of temporal series of remote sensing images. Such a system is 
composed of an ensemble of partially unsupervised classifiers integrated in a 
multiple classifier architecture. The updating problem is formulated under the 
complex constraint that for some images of the considered multitemporal series 
no ground-truth information is available. With respect to the authors’ previous 
works on this topic [1-3], the novel contrihution of this paper consists in: i) 
developing partially unsupervised classification algorithms defined in the 
framework of a cascade-classifier approach; ii) defining a specific strategy for 
the generation of an ensemble of classifiers, which exploits the peculiarities of 
the cascade-classifier approach. These novel aspects result in the definition of 
more robust and accurate classification systems. 



1. Introduction 

One of the major problems in geographical information systems (GIS) consists in 
defining strategies and procedures for a regular updating land-cover maps stored in 
system databases. This crucial task can be carried out by using remote-sensing images 
regularly acquired on the specific area considered by space-born sensors. However, 
despite the production of a land-cover map for a given area can be easily performed 
by using standard supervised classification algorithms [4], the temporal updating of 
these maps is a more complex and challenging problem. The most critical problem 
concerns the availability of ground-truth information. In many cases, it is not possible 
to rely on training data for all the images necessary to ensure an updating of land- 
cover maps as frequent as required by applications. This prevents all the remote- 
sensed images acquired on the investigated area from being analysed by supervised 
classification techniques. 

In previous work [1-3], the aforementioned topic has been addressed by considering 
different aspects of the problem. Firstly, partially unsupervised classification 
methodologies able to update parameters of an already trained classifier, on the basis 
of the distribution of a new image, have been proposed [2-3]. These methodologies 
formulate the problem of the unsupervised retraining of classifiers in the framework 
of a mixture estimation problem solved with the expectation-maximisation (EM) 
algorithm. Secondly, in order to increase the robustness and accuracy of the resulting 
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classification system, the use of a multiple classifier architecture has been proposed 

[1]. 

In this paper we go a step ahead in improving features of the proposed system. In 
particular, we integrate the partially unsupervised classification problem of each 
classification technique in the context of a cascade-classifier approach. This allows 
one to exploit the temporal correlation between images to increase the effectiveness 
of the partially unsupervised classification process. Consequently, the resulting 
classifiers improve their global performances, without significantly increasing the 
classification time. Another issue addressed in this paper concerns the definition of 
the multiple-classifier architecture in presence of cascade-classifier approaches. In 
fact, using cascade classifiers results in the possibility of defining a new simple 
strategy for making up an ensemble of classification algorithms. 

The paper is organised in six sections. Section 2 reports the formulation of the 
problem and describes the general architecture of the system. Section 3 presents the 
partially unsupervised classification problem in the framework of a cascade-classifier 
approach for both the ML and RBF neural networks classification algorithms. Section 
4 addresses the problem of defining suitable ensembles of cascade classifiers. 
Experimental results are reported in Section 5 . Discussion and conclusion are drawn 
in Section 6. 



2. Formulation of the Problem and Description of the General 
Architecture of the System 

2.1 Formulation of the Problem and Simplifying Assumptions 

Let Xj = I and Xj = [rf , | denote two multispectral images of 

dimensions IxJ acquired in the area under analysis at the time tj and ?2, respectively. 
Let x^ and x^j be the feature vectors associated with the 7-th pixel of the images, and 
Q = ^ land-cover classes that characterise the 

geographical area considered at both tj and t2- Let be the classification label of the 
y-th pixel at the time t2- Finally, let and X2 be two multivariate random variables 
representing the pixel values (i.e., the feature vector values) in Xy and X2, 
respectively. 

In the formulation of the proposed approach, we make the following assumptions: i) 
the same set Q of C land-cover classes characterise the area considered over time 
(only the spatial distributions of such classes are supposed to vary); ii) a reliable 
training set Y j for the image Xy acquired at ty is available; iii) a training set Y2 for the 
image X2 acquired at ?2 is not available. 

In the aforementioned assumptions, the proposed system aims at carrying out a 
robust and accurate classification of X2 by exploiting the image Xy, the training set 
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Yy, the image X 2 as well as the temporal correlation between classes at and t 2 
(readers can refer to [1-3] for a discussion on the assumptions considered). 



2.2 System Architecture 

The proposed system is based on a multiple classifier architecture composed of N 
different classification algorithms. The choice of this kind of architecture is due to the 
complexity of the problem addressed. In particular, the intrinsic complexity of the 
partially unsupervised classification problem results in classifiers that are less reliable 
and accurate than the corresponding supervised ones, especially for complex data sets. 
Therefore, by taking into account that generally ensembles of classifiers are more 
accurate and robust than the individual classifiers that make them up [5], we expect 
that a multiple-classifier approach increases the reliability and accuracy of the global 
classification system. A further step in the direction of improving the performances 
of the system consists in the choice of implementing each classification algorithm of 
the ensemble in the framework of a cascade-classifier approach. This point will be 
described and discussed in Section 3. 

The classification results provided by the members of the considered pool of 
cascade classifiers are combined by using classical unsupervised multiple-classifier 
strategies [6-7]. In particular, in this paper we consider two widely used combination 
procedures: the Majority Voting and the Bayesian Combination. 



3. Partially Unsupervised Classification Techniques: A Cascade- 
Classifier Approach 



The standard supervised cascade-classifier approach (proposed by Swain in 1978 [8]) 
exploits the correlation between multitemporal images in order to increase the 
classification accuracy in cases in which training data are available for all the images 
considered. In our approach, we extend the application of the standard supervised 
cascade-classifier approach to partially unsupervised classification problems. 

The cascade-classifier decision strategy associates a generic pixel of the image 

X 2 with a land-cover class according to the following decision rule [8]: 



/y = € £2 if and only if 

P(c 0 ^lx), x;)= max{p(fti,/x‘, xj)} (1) 

where p{a^. j x' , x j ) is the value of the probability that the y-th pixel of the images 

1 2 

belongs to the class cOj^ at t 2 , given the observations Xj and Xj . Under the 

conventional assumption of class-conditional independence [8-9], the above decision 
rule can be rewritten as: 
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€ Q if and only if 

Z p[x] / co^y{x^j / co^ ,coJ= 

max j Z p{x] / (o„ )p{x^j / (o, )p[co „ , a , ) 
®ten [„=1 



( 2 ) 



where p{x‘j f cOj^j is the value of the conditional density function for the pixel x^j , 

given the class & Q, and is the prior joint probability of the pair of 

classes ( ). The latter term takes into account the temporal correlation between 
the two images. 

We propose to integrate the partially unsupervised classification problem of the 
image X 2 in the context of the above-described classification rule. Since the Y 2 

training set is not available, the density functions of classes at the time tj (i.e. 
p{Xi/wfj, 0)„eP2) are the only statistical terms of (2) that we can estimate in a 
supervised way. This means that, in order to accomplish the classification task, we 
must estimate both the density functions of classes at t2 ip(x^lcof), a>i^ eU) and the 

joint class probabilities ( ), ei7) in an unsupervised way. It is worth 

noting that usually the estimation of p(xjaf) i=l,2) involves the 

computation of a parameter vector & . The number and nature of the vector 
components depends on the specific classifier used. 

To carry out the unsupervised estimation process, we propose to adopt an estimation 
procedure based on the observation that, under the assumption of class-conditional 
independence over time, the joint density function of the images and X 2 (p(Xj, 

X 2 )) can be described as a mixture density with CxC components (as many 
components as the possible pair of classes): 



p{Xi,X^) = X p(Xi/(o„)p(x^/o}^)p(o}^,(o^) . (3) 

n=l m=\ 



In this context, the estimation of the above terms becomes a mixture density 
estimation problem, which can be solved by applying the EM algorithm [10-12]. 

The specific procedure to be adopted for accomplishing the estimation process 
depends on the technique considered for carrying out the cascade classification, and in 
particular, on the vector of parameters 9 required by the classifier. The possibility of 
establishing a relationship between the classifier parameters and the statistical terms 
involved in (2) is a basic constraint that each classification technique should satisfy in 
order to permit the use of (2). According to this requirement, we choose two suitable 
classification methods. The former is a parametric approach, based on the maximum- 
likelihood (ML) classifier [4]; the latter consists of a non-parametric technique based 
on radial basis function (RBF) neural networks [13]. The specific procedures for the 
partially unsupervised estimation of the parameters of ML and RBF classifiers are 
described in the following two sub-sections. 
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3.1 Maximum-Likelihood Cascade Classifier 

Let us consider the problem of partially unsupervised cascade classification in the 
framework of ML classifiers. For simplicity, let us assume that the probability density 
function of each class can be described by a Gaussian distribution (i.e. by a mean 
vector //and a covariance matrix 27 ). Under this common assumption (widely 
adopted for multispectral image classification problems), the parameter vector of the 
classifier & consists of the following components: 

5 = [//^2^,f(®i,/»i),...,//^,2^,p(®c,®c)] ■ (4) 

By applying the EM algorithm we can derive the following iterative equations to 
estimate the parameters necessary to accomplish the cascade-classification process 

[3]: 




where, the superscripts t and t+1 refer to the values of the parameters at the current 
and next iterations, respectively. The estimates of the parameters obtained at 
convergence and those achieved hy the classical supervised procedure are then 
substituted into (2) in order to derive the required classification map. 

Concerning the initialization of the considered statistical terms, we refer the reader 
to [3]. 

3.2 RBF Neural Network Cascade Classifier 

The problem of partially unsupervised cascade classification with RBF neural 
networks is significantly more complex than the one associated with the ML 
parametric classifier. The increased complexity mainly depends on the non- 
parametric nature of RBF neural networks. In this context, the joint density function 
of the images Xj and X 2 (p(Xj, X 2 )) is described by a mixture composed of K and Q 
Gaussian kernels at tj and t 2 , respectively (both K and Q are greater than Cj. 
Consequently equation (3) can be rewritten as: 
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P{^1 l(Pk)p{^2 l(Pq)p{(Pk^(P^P{(On^O)ml(Pk^(Pq) ■ 

n=\ m=\k=\ q=\ 

( 8 ) 

Each gaussian function is described by its mean vector //, and by a width 
parameter cr, . Consequently, the parameter vector of the classifier 3 is composed of 
the following terms: 

P ~ [m ’^1 \-"^P^k^Vq)'P^i ’®i / • 

(9) 

By applying the EM algorithm we can derive the following iterative equations to 
estimate the required parameters: 




7x7 / \ 

'LP‘[(Pk^(Pq lx),x]) 

P{<Pk,<P,Y'=- ( 12 ) 

1 ^ J 



where d is the dimensionality of the input space, and the superscripts t and t+7 refer 
to the values of the parameters at the current and next iterations, respectively. 
Although the parameters , cr^ P((Pk ’^q) vector d can be estimated in a 

fully unsupervised way, the estimation of the joint conditional probabilities 
P[o)„,(Om/V’k’V’q) requires additional information. In this context, we propose to 
exploit some of the information obtained at the convergence from the ML cascade 
classifier. In particular, a set Y 2 of pixels, composed of the patterns that are most 
likely correctly categorized by the ML cascade classifier, is used for the initialization 
of the P{o)^,a>^l(p^,(p^^ conditional probabilities. Let Y;„be the pixels of the 
training set Y; that belong to land-cover class . Similarly, let Y 2 ^ be the sub-set of 
pixels of Y 2 categorized by the ML cascade classifier as belonging to class . The 
iterative equation to be used for estimating the joint probabilities P{(On’^mlv>k’V>q) 
is the following: 
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ZZ. 

:,'eY, , x]<eY^^ 

ZZ . P'((Pk^(Pjx),x^j) 

x,'eY, x]<eY^ 



Z Z Z (®/ . O^mlVk - (Pq )p' {(Pk - (P^ ! Xj , Xj ) 

x'eY,^ xje% 





_ZZ . P‘(<Pk^<Pjx],x]) 

A:|eY, x^jS:Y2 


ZZ. 

x‘gYj xjeY2 ^ 


TP‘{(o,,co./q},,<p^)p‘(<p,,g?^ Ix],x]) 

i=l 



ZZ P'i<Plc’<P, /Xj,Xj) 

x'eY, x^£% 



(13) 



As for the previous cascade classifier based on the ML technique, the estimates of 
the parameters obtained at convergence and those achieved by the classical supervised 
procedure are then used to accomplish the cascade classification. 



4. A Strategy for Generating Ensembles of Cascade-Classifier 
Algorithms 

The selection of the pool of classifiers to be integrated into the multiple-classifier 
architecture is an important and critical task. In the literature, several different 
strategies for defining the classifier ensemble have been proposed [6-7], [14-15]. 
From a theoretical view point, a necessary and sufficient condition for an ensemble of 
classifiers to be more accurate than any of its individual members is that classifiers 
are accurate and diverse [16]. In our case, we can control only the second condition, 
since no training set is available to verify the first one. However, it is reasonable to 
assume that the majority of unsupervised cascade classifiers of the ensemble are 
sufficiently accurate. The main issue that remains to be solved for the definition of the 
ensemble concerns the capability of different classifiers to incur in uncorrelated 
errors. In our system, the choice of both parametric (ML) and a non-parametric (RBF) 
classifiers guarantees the use of two classification algorithms based on significantly 
different principles. For this reason, we expect these classifiers incur in quite 
uncorrelated errors. However, two classification algorithms are not sufficient to 
define an effective multiple classifier architecture. This issue is more critical in our 
specific problem, where we cannot test the accuracy of each member of the ensemble. 
Therefore, the probability that the partially unsupervised estimation procedures result 
in classifiers affected by a significant error rate is higher than in the standard 
supervised case. Consequently, to increase the reliability of the system, we need to 
generate a pool of N classifiers, with N>2. According to the literature, we can define 
different RBF architectures in order to define different classification algorithms for 
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the ensemble [17], However, since we are dealing with cascade-classifier approaches, 
we propose to use an alternative, deterministic, and simple strategy for defining the 
ensemble. This strategy is based on our classifier peculiarities. In particular, in the 
case of cascade classifiers, one set of key parameters estimated in the partially 
unsupervised process is composed of the joint class probabilities which 

are associated with the temporal correlation between classes. The different classifiers 
(i.e. ML and RBF) perform different estimations of the aforementioned probabilities, 
on the basis of the different classification and estimation principles. For this reason, 
we propose to introduce in the ensemble classifiers obtained by exchanging the 
estimates of the prior joint probabilities of classes performed by different algorithms. 
In this way, we merge the parameters estimated with different procedures in order to 
obtain different classifier configurations. In our case, given an ML and an RBF 
cascade classifiers, this strategy results in an ensemble composed of the two 
“original” classifiers and two additional ML and RBF algorithms obtained by 
exchanging the prior joint probabilities estimated in a partially unsupervised way by 
the original classifiers. This involves a multiple classifier architecture composed of 
four classifiers. It is possible to further increase the number of classifiers by extending 
the aforementioned procedure to a case with more RBF neural network architectures. 



5. Experimental Results 

To assess the effectiveness of the proposed approach, different experiments were 
carried out on a data set made up of two multispectral images acquired by the 
Thematic Mapper (TM) sensor of the Landsat 5 satellite. The selected test site was a 
section (412x382 pixels) of a scene including Lake Mulargias on the Island of 
Sardinia, Italy. The two images used in the experiments were acquired in September 

1995 (t;) and July 1996 (t 2 ). The available ground truth was used to derive a training 
set and a test set for each image. Five land-cover classes (i.e., urban area, forest, 
pasture, water, vineyard), which characterise the test site at the above-mentioned 
dates, were considered (see [2] for a detailed description of the data set composition). 
To carry out the experiments, we assumed that only the training set associated with 
the image acquired in September 1995 was available. 

An ML and an RBF neural network cascade classifiers (with 50 hidden neurons) 
were applied to the September 1995 and July 1996 images. For the ML classifier, the 
assumption of Gaussian distributions was made for the density functions of classes 
(this was a reasonable assumption, as we considered TM images). In order to exploit 
the non-parametric characteristic of the RBF neural classifier, 5 texture features based 
on the Gray-Level Co-occurence matrix [18] were given as input to this classifier in 
addition to the 6 TM channels. From the considered ML and RBF cascade classifiers, 
other two classifiers were generated by exchanging the prior joint probabilities of 
classes according to the strategy described in Section 4. The classification accuracies 
exhibited by the four considered partially unsupervised cascade classifiers on the July 

1996 test set are reported in Table 1. 
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Table 1. Classification accuracy exhibited by the four partially unsupervised cascade classifiers 
included in the proposed multiple classifier architecture (July 1996 test set) 



Land cover class 






Classification accuracy (%) 




ML 


RBF 


ML with joint probabilities 
computed by RBF 


RBF with joint probabilities 
computed by ML 


Pasture 


87.6 


97.8 


85.9 


96.5 


Forest 


97.4 


96.0 


97.4 


96.4 


Urban area 


94.4 


97.9 


94.2 


97.7 


Water 


100.0 


100.0 


100.0 


100.0 


Vineyard 


64.9 


83.0 


62.3 


88.1 


Overall 


92.6 


97.3 


91.8 


97.2 



At this point, the four classifiers were combined by using both the majority voting 
and the Bayesian combination strategies. The overall accuracies obtained are given in 
Table 2, where the accuracies exhibited by the multiple classifier system proposed in 
[1] are also reported. By a comparison of Table 1 and 2, one can conclude that the 
classification accuracies provided by the considered ensemble of partially 
unsupervised cascade classifiers are higher than both those obtained by the single 
classifiers composing the ensemble and those yielded by our previous system. 



6. Conclusions 

In this paper a multiple-classifier system for a partially unsupervised updating of 
land-cover maps has been proposed. The main features of the proposed system are the 
following: i) capability to exploit temporal correlation between multitemporal images; 
ii) capability to consider multisensor/multisource data in the process of updating of 
land-cover maps (thanks to the availability of non-parametric classification algorithms 
in the ensemble); hi) capability to easily define the multiple-classifier architecture to 
be adopted. In the experiments we carried out, the proposed multiple classifier system 
revealed effective, providing classification accuracies higher that those exhibited by 
both the single partially unsupervised cascade classifiers composing the ensemble and 
the classification system presented in [1]. 

The future developments of this work are now addressed in two different directions: 
i) to extend the partially unsupervised cascade-classification approach to other kind of 

Table 2. Overall classification accuracies exhibited by the proposed multiple classifier system 
and by the system presented in [1] 



Proposed system 


System proposed in [1] 


Bayesian 


Majority 


Bayesian 


Majority 


combination 


rule 


combination 


rule 


98.0% 


97.8% 


96.5% 


96.4% 
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classification techniques; ii) to study from a theoretical perspective the problem of the 
ensemble definition in presence of cascade-classification algorithms. 
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Abstract. This article focuses on the use of multiple classifier systems (MCSs) 
based on dynamic classifier selection. Four implementation strategies of MCSs 
are compared: majority voting, belief networks, and two designs based on dy- 
namic classifier selection. Experimental results indicate that the direction taken 
by Woods et al. [1] is the best alternative for remote sensing applications for 
which the classifier-dependent posterior distributions are unknown. 



1 Introduction 

In remote sensing image data interpretation, there are two main categories of error. 
The first category is the labeling inconsistencies, which regards the representativeness 
of the training (and test) data and is due to mixed pixels (class overlap), transition 
zones, dynamic zones, within-class variability (covariance), limited training data, and 
topographic shading, just to name a few. In fact, it is due to these factors that the su- 
pervised classification of remote sensing images distinguishes itself from many other 
pattern recognition application domains: the physical parameters with which ground 
information is collected is often fundamentally different from the physical parameters 
collected by the sensor. For instance, when a human observer collects ground infor- 
mation for multspectral, hyperspectral, or microwave imagery, criteria are used that 
are not optimized for the natural clusters that characterize the data. This type of error 
is difficult to quantify. 

The second type of error, the classification-induced error, can be reduced using 
carefully defined classes and number of classes, classification schemes, and the choice 
of the feature vector. 

The subjectivity inherent in the training and testing of remote sensing image classi- 
fiers is widely recognized [2], [3]. The subjectivity is further amplified by the fact that 
training and test data are expensive, which poses limits on the quantity and sometimes 
on the quality of the training data. 

Nowadays, classifiers are understood well enough to reproduce virtually any deci- 
sion space in the feature space of a data set, reducing considerably the influence of the 
classification induced error. In this paper it is maintained that the main power of the 
application of Multiple Classifier Systems (MCSs) to remote sensing image classifi- 
cation is to reduce the influence of the first type of error mentioned above, namely the 
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labeling inconsistencies. It should be noted that if the classes separate well in the fea- 
ture space, all classifiers should return the same result, and applying an MCS is not 
likely to be of any benefit. 

The are two basic approaches to MCS design [1]: classifier fusion and dynamic 
classifier selection. The former approach combines in parallel the outputs of the classi- 
fiers in order to achieve some kind of "group consensus." The latter attempts to predict 
which single classifier is most likely to be correct for a given sample. 

Many realizations of MCSs exploit the posterior probability of the classifiers rather 
than the classification results themselves. Examples include the consensus theory 
which uses the source-specific posterior probability [4], and the combination of the 
Bayesian average [5], [6]. 

In operational remote sensing, Bayesian classification methods are not frequently 
used, and a-posteriori probabilities are often rough estimations [4]. Huang and Suen 
[7] argue that the research on classifiers that output a unique class label indicating that 
this class has the highest probability to which the object belongs, will become most 
important in handwriting recognition. We feel that a similar statement would hold 
good for the field of remote sensing. 

In this article the attention is focused on the category of MCSs that allow to com- 
bine different classification results and their respective confusion matrices, however 
without precise knowledge about the classifier- specific posterior probability. The aim 
of this article is to investigate the thesis that for remote sensing applications, MCSs 
using individual class performances are the most appropriate if no estimates of the a- 
posteriori probability are available. The novelty of the article is that it compares four 
different MCS design strategies using publicly available remote sensing data sets. The 
MCSs compared are based on: 1) the majority rule; 2) belief functions; 3) dynamic 
classifier selection by simple partitioning (DCS-SP); and 4) dynamic classifier selec- 
tion by local accuracy (DCS-LA). 

This paper is organized as follows. Section 2 focuses on previous work done in the 
field of MCSs based on individual classifier behavior, and explains briefly the four 
MCSs mentioned in the previous paragraph. Section 3 reports results on real-world 
data. Section 4 provides a discussion and draws the conclusions based on the experi- 
ments. 

For more generic reviews on multiple classifier systems and combining classifiers 
the reader is referred to [8] and [9]. 



2 MCSs Based on Individual Classifier Behavior 

Let Z be the object to be assigned to one of the M possible classes As- 

sume that we have K classifiers each representing the given object by a feature vector 
X k , k = 1, K. The output of classifier k assigned to feature vector At is denoted 
by C^(At). In the measurement space each class is modeled by the probability 
density function p{x t | ) and its a-priori probability of occurrence is denoted by 
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It is widely agreed that the best possible way to assign class labels is according to 
the Bayesian theory [10]. This holds good for individual classifiers as well as for 
MCSs. Given the measurements Xk,k= 1, K, the object should be assigned to 
class a>^ , provided the a posteriori probability of that interpretation is maximum: 

p{d = 1 Xi,...,Xk)= max(p(0 = (0^ 1 0 ^ CO^ 

The underlying hypothesis of using individual classifier behavior in the MCS de- 
sign is that every classifier has characteristics justifying its participation in the MCS. 
If a for the data set under analysis very good classifier is applied (e.g., one based on 
the maximum a-posteriori [MAP] criterion with appropriate knowledge built-in) in 
combination with mediocre classifier (e.g., a minimum distance to class mean [MD] 
criterion), the result of the MCS is not likely to improve the MAP criterion. 

The most straightforward way of exploiting the individual class performances of 
the combined classifiers is by means of dynamic classifier selection. Based on some 
method of partitioning the input samples, it is predicted which single classifier is most 
likely to be correct for a given sample. An example is partitioning by the set of indi- 
vidual classifier decisions [7], which we will refer to as dynamic classifier selection 
by simple partitioning (DCS-SP). In this case the feature space is partitioned based on 
the global performance of a classifier on the individual classes. 

Another important means of MCS design exploiting the knowledge of the deci- 
sions made by classifiers on individual classes is based on belief functions. Belief 
functions often derive this knowledge from the confusion matrix computed from the 
training data [ 8 ]. Let P^ = cd^ |C^(xjr)) be the probabilities estimated from the con- 
fusion matrix. Then the belief function for class label becomes: 

beli0 = a,J = fjflp(0 = a,JC,(Xk)) 

k=l 

M 

where 77 is a normalizing constant ensuring that =co ) = 1- Then the MCS 

m=l 

assigns the class label with the highest belief value. 

In [1], an MCS is presented that uses estimates of each individual classifier's local 
accuracy in small regions of feature space surrounding an unknown test sample. The 
method, called Dynamic Classifier Selection by Local Accuracy (DCS-LA), considers 
only the output of the most locally accurate classifier. The local regions are defined in 
terms of the K-nearest neighbors in the training data. The best results were obtained 
by using the percentage of the local class accuracy as performance measure. 

In the next section the performance of the different MCSs described above will be 
compared. Table 1 summarizes the MCSs. 
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Table 1. Overview of the MCSs used in the experiments. 



MCS 

short 

name 


Description 


References 


Majority 

rule 


MCS based on the simple majority rule 


Giacinto et al. 
[6] 


Belief 

function 


Based on belief functions created from the con- 
fusion matrix (from training set); implements 
equation (2) 


Xu et al. [8] 


DCS-SP 


Dynamic Classifier Selection by Simple Parti- 
tioning. Uses the average class accuracy to as- 
sign each classifier to a class (see table 2) 


Huang and 
Suen [7] 


DCS-LA 


Checks the individual class assignment of the 10 
nearest samples, and selects the output of the 
locally most accurate classifier. 


Woods et al. 
[1] 



3 Experiments 

The main aim of the experiments is to evaluate the different design strategies based on 
dynamic classifier selection and to compare these approaches to the much-used MCSs 
based on majority voting, and on belief functions. 

Two publicly available data sets are used. Data set A is a multi-sensor, multi- 
spectral data set, and data set B is a hyperspectral data set. For each data set, different 
statistical classifiers are defined and combined in the different MCSs design strategies. 

The performance of all classifiers and MCSs are measured by three parameters: the 
minimum accuracy, the maximum accuracy, and the kappa value. 

3.1 Data Set A 

3.1.1 Data Set Description 

Data set A consists of Airborne Thematic Mapper imagery, co-registered with 
NASA/JPL synthetic aperture radar imagery, acquired over the agricultural area of 
Feltwell (UK). The images, 15 in total, are filtered and normalized. The set was first 
used in [11] and [12], and later in [6]. The data set contains training and test pixels 
(5124 and 5820 pixels, respectively). In the following, the different bands of this data 
set are identified by the letter b (band) and an index. 

3.1.2 Classifier Definition 

Four classifiers are build, all based on statistical image models and therefore with the 
lowest possible design complexity. Classifiers A.1-A.4 relate to data set A.. Classifier 
A.l: Maximum likelihood (ML) classifier. The features used in classifier A.l are se- 
lected by a feature subset selection based on the ML criterion and a maximum allowed 
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mean error on the training data of 0.10 [13]. The feature subset selection based on the 
training pixels of data set A resulted in a feature vector of four features: 

X i = b^ J; > where X i is the feature vector for classifier 1 at pixel 

i, b denotes the indexed band number, and T the transpose. Classifier A.2: Maximum 
likelihood (ML) classifier. The features used in classifier A.2 are selected by a feature 
subset selection based on the criterion proposed by Fukunaga ([15], equation 10.5); a 
maximum of five features was allowed. With these constraints, the feature subset 
selection based on the training pixels of data set A resulted in the feature 

xt^ =\b^ b,, b, b^ A, F, where Xi is the feature vector for classifier A.2 

at pixel i. Classifier A.3: Minimum distance to class mean (MD) classifier. The fea- 
tures used in classifier A.3 are selected by a feature subset selection based on the di- 
vergence criterion [15]; again, a maximum of five features was allowed. Based on the 
training data in data set A, this resulted in the feature subset 

X i = \b^^ b-, bg b^^\. , where Xi is the feature vector for classifier A.3 

at pixel i. Classifier A.4: Classifier A.4 is based on maximum a-posteriori probability 
(MAP) implemented in a Markov random field (MRF) framework, as described in 
[16]. The features used in this classifier are the same as those used in classifier 1: 

X f ^ = [l?j2 b^ b^ , i.e., the image model is identical to the class conditional 

density functions defined by the ML approach. The settings of the MRF-MAP ap- 
proach are: /? = 2.5, a second order neighborhood system is used for the definition of 
the clique interactions, and a stochastic energy optimization is used with 300 itera- 
tions. 



3.1.3 Definition of the MCSs 

Table 1 summarizes the MCSs that have been implemented and tested. The Belief 
functions are created based on the confusion matrices of the individual classifiers 
applied to the training data set. 

The DCS-SP approach is implemented based on the best classifier for each class. 
Table 2 gives an overview the classifier selected for each of the five classes, based on 
the classifier performance on the training data. Note that classifier A.l is never se- 
lected in this approach. 

For the DCS-LA algorithm, the Euclidean distance metric has been used. Note that 
the DCS-LA algorithm is applied only when classifiers do not agree. The number of 
nearest neighbors that are used in the DFC-LA algorithm is 10 [1]. 

3.1.4 Results 

Table 3 reports the performances of the four individual classifiers on data set A. The 
MRF-MAP approach gives the best results in terms of maximum accuracy and kappa 
value. Table 4 shows the confusion matrix for classifier 4. It is interesting to note that, 
although relatively simple statistical classifiers have been used (the design complexity 
[6] was 1 for all classifiers), good results have been obtained. Three of the four classi- 
fiers on data set A have a maximum accuracy higher than the best classifier in [6]. 
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Table 2. Example of the selection of best classifier for each partition for use in the DCS-SP 
approach, based on the training class accuracies of data set A. 



Class 1 


Class 2 


Class 3 


Class 4 


Class 5 


P(0 in 1|C1 = 1) 


0.94 


P(0 in 2|C1=2) 


0.84 


P(0 in3|Cl=3) 


0.65 


P(0 in4|Cl=4) 


0.92 


P(0 in5|Cl=5) 


0.89 


PtO in llC2=li 


0.95 


P(0 in 2|C2=2) 


0.82 


pro in 3lC2=.3i 


0.71 


P(0 in4|C2=4) 


0.92 


P(0 in 5|C2=5) 


0.92 


P(0 in 1|C3=1) 


0.76 


P(0 in 2|C3=2) 


0.33 


P(0 in |C3=3) 


0.28 


P(0 in 4|C3=4) 


0.76 


P(0 in 5|C3=5) 


0.54 


P(0 in 1|C4=1) 


0.92 


P(0 in 2lC4=2t 


0.85 


P(0 in 3|C4=3) 


0.62 


pro in 4lC4=4i 


0.93 


PfO in .5lC4=.5i 


0.93 



Table 3. Summary of the test set accuracies (user’s accuracies) and kappa values of the four 
classifiers. The best performances of data set A the are underlined. 



Classifier 


Minimum accuracy (%) 


Maximum accuracy (%) 


Kappa 


A.l 


70.4 


94.9 


0.860 


A.2 


76.6 


94.9 


0.865 


A.3 


16.1 


70.8 


0.446 


A.4 


69.5 


96.4 


0.868 



Table 5 summarizes the results of the MCSs. The DCS-SP method outperforms the 
other MCSs in terms of maximum accuracy, but its minimum accuracy and kappa 
value are outperformed by the DCS -LA approach. Note that both the majority rule and 
the DCS-LA approaches outperform the minimum accuracy and kappa value of the 
MRF-MAP labeling. Comparing the confusion matrix of the DCS-LA result (Table 6) 
with the confusion matrix of the best individual classifier (Table 4), one can see that 
the user’s accuracy of the weakest class (class 3) is improved by more than 10 points. 
From the results on data set A we may conclude that the DCS-LA approach is the best 
for the combination of classifiers A.1-A.4. 



Table 4. Confusion matrix of classifier A.4 (MRF-MAP on data set A). 





1. 


2. 


3. 


4. 


5. 


Sum 


User’s accuracy 


Class 1. 


1945 


45 


4 


22 


1 


2017 


96.4 


Class 2. 


24 


1160 


166 


5 


0 


1355 


85.6 


Class 3. 


21 


90 


381 


56 


0 


548 


69.5 


Class 4. 


34 


17 


8 


783 


35 


877 


89.3 


Class 5. 


30 


13 


0 


9 


911 


963 


94.6 


Sum 


2054 


1325 


559 


875 


947 


5760 




Prod, accu- 
racy 


94.7 


87.6 


68.2 


89.5 


96.2 
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Table 5. The performances of the MCSs on the test set A. 



MCS 


min. accuracy [%] 


max. accuracy [%] 


Kappa [.] 


Majority rule 


76.6 


95.3 


0.870 


Belief functions 


59.7 


95.1 


0.853 


DCS-SP 


69.9 


95.9 


0.861 


DCS-LA 


81.4 


95.1 


0.873 



Table 6. Confusion matrix of the DCS-LA result, based on input from classifiers A.1-A.4. 





1. 


2. 


3. 


4. 


5. 


Sum 


User’s accu- 
racy 


Class 1. 


1919 


55 


3 


34 


6 


2017 


95.1 


Class 2. 


8 


1175 


162 


7 


3 


1355 


86.7 


Class 3. 


20 


23 


446 


57 


2 


548 


81.4 


Class 4. 


31 


21 


8 


785 


32 


877 


89.5 


Class 5. 


36 


20 


0 


29 


878 


963 


91.2 


Sum 


2014 


1294 


619 


912 


921 


5760 




Producer’s 

accuracy 


95.3 


90.8 


72.1 


86.1 


95.3 







3.2 Data Set B 

3.2.1 Data Set Description 

Data set B is the hyperspectral data set that comes with the documentation of the pub- 
licly available application software MultiSpec [17]. It consists of 220 bands, and the 
ground truth samples of 16 classes. For the experiments reported here, the ground truth 
samples have been split in a training and a test set. 

3.2.2 Classifier Definition 

Classifier B.l: Maximum likelihood (ML) classifier. The features used in classifier 
B.l are selected by a feature subset selection based on the ML criterion and a maxi- 
mum allowed mean error on the training data of 0.15. This feature subset selection, 
based on the training pixels of data set B, resulted in a feature vector of 14 features, 
corresponding to bands 143, 168, 29, 71, 35, 16, 42, 198, 31, 60, 20, 123, 133, and 
131. Classifier B.2: Maximum likelihood (ML) classifier. The features used in classi- 
fier B.2 are selected by a feature subset selection based on the criterion proposed by 
Fukunaga ([16], equation 10.5); a maximum of 13 features was allowed. With these 
constraints, the feature subset selection based on the training pixels of data set B re- 
sulted in a feature vector with bands 167, 10, 102, 140, 181, 41, 52, 37, 122, 17, 60, 
29, 142. Classifier B.3: Like classifier A.4, classifier B.4 is based on the MRF-MAP 
approach. The features used in this classifier are the same as those used in classifier 
B.L The settings of the MRF-MAP approach are: /?= 2.5, a second order neighbour- 
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hood system is used for the definition of the clique interactions, and a stochastic en- 
ergy optimization is used with 300 iterations. 

3.2.3 Definition of the MCSs 

On data set B, the two best MCSs identified in subsection 3.1 are compared: the ma- 
jority voting approach, and the DCS-LA approach (see Table 1). Both MCSs use the 
three classifiers B.1-B3. 

Like the DCS-LA used on classifiers A.1-A.4, the DCS-LA uses 10 samples to 
determine the locally most accurate classifier. 

3.2.4 Results Set B 

Table 7 reports the performances of the four individual classifiers on data set B. Also 
here, the MRF-MAP approach gives the best results. 

Table 8 summarizes the results of the MCS based on the majority rule and on the 
DCS-LA approach. 

In the case of data set B the minimum accuracy does not change when combining 
the classifiers. This is due to the very low number of test pixels available for class 9 
(see also the confusion matrix in Table 9). For data set B the kappa value is considered 
more relevant. 



Table 7. Summary of the test set user’s accuracies and kappa values of the four classifiers de- 
fined for data set B. The best performances of data set B are underlined. 



Classifier 


Minimum accuracy (%) 


Maximum accuracy (%) 


Kappa 


B.l 


33.3 


100.0 


0.765 


B.2 


33.3 


99.3 


0.715 


B.3 


33.3 


100.0 


0.877 



Table 8. The performances of the MCSs on the test set B based on B.l, B.2, and B.3. The DCS- 
LA computes the distance measure from the bands that are used in B. 1-B.3. 



MCS 


min. accuracy [%] 


max. accuracy [%] 


Kappa [.] 


Majority rule 


33.3 


100.0 


0.829 


DCS-LA 220 


33.3 


100.0 


0.847 



4 Discussion and Conclusions 

MCSs are used to approach better an ideal Bayesian classifier than the individual 
classifiers applied separately. It is widely agreed upon that Bayesian, MAP labeling of 
images provides the most powerful approach to image analysis currently available 
[10]. The only way in which this approach can be improved is by incorporating appli- 
cation-specific, prior knowledge into the analysis problem. This knowledge can take 
various forms, and the use of data fusion principles that incorporate existing and pre- 
viously computed information with newly acquired data in a Bayesian framework 
seems to be the most promising direction to continue. 
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Here, it is felt that MCS have the potential to approach the MAP approach in ab- 
sence of precise prior knowledge. In this paper MCS design strategies based on ma- 
jority voting, belief networks, DCS-SP, and DCS-LA were compared. The experi- 
ments reported in this paper indicate that the DCS-LA described by Woods et al. [1] is 
the preferred one among the tested MCS design strategies. However, better bench- 
marking is desired to confirm these findings. 

Also, it should be noted that DCS-LA needs about 12 minutes for the Feltwell data 
set on a 450 MHz personal computer, against less than a second for Majority rule or 
the Belief networks. Users should therefore be aware that the improvement in accu- 
racy may have a considerable price in terms of computing time. 
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available online via http://dynamo.ecn.purdue.edu/--biehl/MultiSpec/. Part of the re- 
sults have been generated with the freely available software application program Re- 
sima (http://www.resima.com/). 



Table 9. Confusion matrix of the DCS-LA result, based on input from classifiers B.l, B.2, B.3. 
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Interested readers may contact the author for the software (Visual C++ project) that 
implements the MCSs discussed in this paper. 
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Abstract. The need to optimize the classification accuracy of remotely 
sensed imagery has led to an increasing use of Earth observation data 
with different characteristics collected from a variety of sensors from 
different parts of the electromagnetic spectrum. Combining multisource 
data is believed to offer enhanced capabilities for the classification of 
target surfaces. In the paper several single and multiple classifiers which 
are appropriate for classification of multisource remote sensing and ge- 
ographic data are considered. The focus is on multiple classifiers: bag- 
ging algorithms, boosting algorithms, and consensus theoretic classifiers. 
These multiple classifiers have different characteristics. The performance 
of the algorithms in terms of accuracies is compared for a multisource 
remote sensing and geographic data set. 



1 Introduction 

Traditionally, in pattern recognition, a single classifier is used to determine which 
class a given pattern belongs to. In many cases, classification accuracy can be 
improved by using an ensemble of classifiers in the classification. In such a case 
it is possible to have the individual classifiers support each other in making a 
decision. The aim is to determine an effective combination method which uses 
the benefits of each classifier but avoids the weaknesses. 

In this paper, three multiple classifiers are investigated in terms of classifica- 
tion of multisource remote sensing and geographic data. The paper is organized 
as follows. First, multiclassifier systems are discussed with a special emphasis on 
the recently proposed bagging and boosting algorithms, and statistical consen- 
sus theory. Experimental results for a multisource remote sensing and geographic 
data set are given in Section 3. Finally, conclusions are drawn. 



2 Multiclassifier Systems 

Several methods have been proposed to combine multiple classifiers [1-3]. 
Wolpert [1] introduced the general method of stacked generalization where out- 
puts from classifiers are combined in a weighted sum with weights which are 



J. Kittler and F. Roll (Eds.): MCS 2001, LNCS 2096, pp. 279- E^ 2001. 
© Springer- Verlag Berlin Heidelberg 2001 



280 



G.J. Briem, J.A. Benediktsson, and J.R. Sveinsson 



based on the individual performance of the classifiers. Turner and Gosh [2] have 
also shown that substantial improvements can be achieved in difficult pattern 
recognition problems by combining or integrating the outputs of multiple clas- 
sifiers. Benediktsson et al. combined classifiers using neural networks [4-5] and 
improved their overall accuracies as compared to the best results of the single 
classifiers involved in the classification. 

Multiclassifier systems [6-7] have been used since the sixties. Currently, two of 
the most used multiclassification approaches are boosting [8-9] and bagging [10]. 
Both these approaches are based on manipulating training samples. In contrast 
statistical consensus theory is based on independence between data sources and 
uses all the training data only once. All three approaches are discussed briefly 
below. 



2.1 Boosting 

Boosting is a general method which is used to increase the accuracy of any 
classifier. Several versions of boosting have been proposed but we will concentrate 
on AdaBoost [9] which was proposed in 1995. In particular, we will use the 
AdaBoost.Ml method which can be used on classification problem with more 
than two classes. A version of the AdaBoost algorithm is shown below. 



Input: A training set S with m samples, base classifier I and number of 
classifiers T. 

1. Si = S and weight]*^) — 1 ioT j = 1 . . . m {x G Si) 

2. For i = 1 to T{ 

3. Ci=l{Si) 

4. weight (xj) 

5. If > 0.5, abort! 

6 . = ti / {1 - ti) 

7. For each Xj G Si{ if Ci{xj) = yj then 
weight (xj) = weight (xj) ■ l3i}. 

8. Norm weights such that the total weight of Si is m. 

9. } 



10. C*{x) = argmax 



y GY i:Ci(x)=y 

Output: The multiple classifier C* . 



In the beginning of AdaBoost, all patterns have the same weight and the 
classifier Ci is the same as the base classifier. If the classification error is greater 
than 0.5, then the method does not work. Then, the procedure is usually stopped 
(in failure) . A demand is therefore made on the minimum accuracy of the base 
classifier, which can be of considerable disadvantage in multiclass problems. It- 
eration by iteration, the weight of the samples which are correctly classified goes 
down. Therefore, the algorithm starts concentrating on the difficult samples. At 
the end of the procedure, T weighted training sets and T base classifiers have 
been generated. 
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The main advantage of AdaBoost is that in many cases it increases the over- 
all accuracy of the classification. Many practical classification problems include 
samples which are not equally difficult to classify and AdaBoost is suitable for 
such problems. AdaBoost tends to exhibit virtually no overfitting when the data 
is noiseless. Other advantages of boosting include that the algorithm has a ten- 
dency to reduce both the variance and the bias of the classification. On the other 
hand, AdaBoost is computationally more demanding than other simpler meth- 
ods. Therefore, it is dependent on the classification problem whether it is more 
valuable to get increased classification accuracy or to obtain a simple and fast 
classifier. Another problem with AdaBoost is that it usually does not perform 
well in terms of accuracies when there is noise in the data. 



2.2 Bagging 

Bagging is an abbreviation of bootstrap aggregating. Bootstrap methods are 
based on randomly and uniformly collecting m samples with replacement from a 
sample set of size m. The bagging algorithm was proposed in 1994 [10] and con- 
structs many different bags by performing bootstrapping iteratively, classifying 
each bag, and computing some type of an average of the classifications of each 
sample via a vote. Bagging is in some ways similar to boosting since both meth- 
ods design a collection of classifiers and combine their conclusions with a vote. 
However, the methods are different, e.g. because bagging always uses resampling 
instead of reweighting, it does not change the distribution of the samples (does 
not weight them) and all classes in the bagging algorithm have equal weights 
during the voting. It is also noteworthy that bagging can be done in parallel, 
i.e., it is possible to design all the bags at once. On the other hand boosting is 
always done in series, and each sample set is based on the latest weights. The 
bagging algorithm can be written as: 

Input: A training set S with m samples, base classifier T and number of 
bootstrapped sets T. 

1. For i = 1 to T{ 

2. Si = bootstrapped bag from S 

3. Ci = X{Si) 

4 . } 

5. C*{x) = argmax 1 

y e V i:Ci(x)=y 

Output: The multiple classifier C* . 

From the above it can be seen that bagging is a very simple algorithm. A simple 
majority vote is used, but if more than one class jointly receives the maximum 
number of votes, then the winner is selected using some simple mechanism, e.g. 
random selection. For a particular bag Si the probability that a sample from S is 
selected at least once in m tries is 1 — (1 — l/m)"*. For a large m the probability 
is approximately 1 — 1/e « 0.632 indicating that each bag only includes about 
63.2% of the samples in S. If the base classifier is unstable, that is, when a 
small change in training samples can result in a large change in classification 
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accuracy, then bagging can improve the classification accuracy significantly. If 
the base classifier is stable, like e.g., k-NN classifier, then bagging can reduce the 
classification accuracy because each classifier receives less of the training data. 

The main advantage of the bagging algorithm is that it can increase the 
classification accuracy significantly if the base classifier is well selected. The 
bagging algorithm is also not very sensitive to noise in the data. The algorithm 
uses the instability of its base classifier in order to improve the classification 
accuracy. Therefore, it is of great importance to select the base classifier carefully. 
This is also the case for boosting since it is sensitive to small changes in the 
input signal. Bagging reduces the variance of the classification (just as boosting 
does) but in contrast to boosting bagging has little effect on the bias of the 
classification. 

2.3 Consensus Theory 

Consensus theory is not based on manipulating the training data like bagging and 
boosting. Consensus theory [4,11] involves general procedures with the goal of 
combining single probability distributions to summarize estimates from multiple 
experts with the assumption that the experts make decisions based on Bayesian 
decision theory. The combination formula obtained is called a consensus rule. The 
consensus rules are used in classification by applying a maximum rule, i.e., the 
summarized estimate is obtained for all the information classes and the pattern 
X is assigned to the class with the highest summarized estimate. Probably, the 
most commonly used consensus rule is the linear opinion pool (LOP) which is 
based on a weighted linear combination of the posterior probabilities from each 
data source. Another consensus rule, the logarithmic opinion pool (LOGP), is 
based on the weighted product of the posterior probabilities. The LOGP differs 
from the LOP in that it is unimodal and less dispersed. Also, the LOGP treats 
the data sources independently. 

The weighting schemes in consensus theory should reflect the goodness of 
the input data. The simplest approach is to give all the data sources equal 
weights. Also, reliability measures which rank the data sources according to 
their goodness can be used as a bases for heuristic weighting [4]. Furthermore, 
the weights can be chosen to not only weight the individual sources but also the 
individual classes. For such a scheme both linear and nonlinear optimization can 
be used. 

3 Experimental Results 

The multiple classifiers (bagging, boosting, and consensus theoretic classifiers) 
were compared in experiments. The data used in the experiments, the Anderson 
River data set, are a multisource remote sensing and geographic data set made 
available by the Ganada Gentre for Remote Sensing (GGRS) [12]. This data set 
is very difficult to classify [4]. 

Six data sources were used: 
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Table 1. Training and Test Samples for Information Classes in the Experiment on the 
Anderson River Data. 



Class # 


Information 

Class 


Training 

Size 


Test 

Size 


1 


Douglas Fir (31-40m) 


971 


1250 


2 


Douglas Fir (21-30m) 


551 


817 


3 


Douglas Fir + Other Species(31-40m) 


548 


701 


4 


Douglas Fir + Lodgepole Pine (21-30m) 


542 


705 


5 


Hemlock + Cedar (31-40m) 


317 


405 


6 


Forest Clearings 


1260 


1625 


II Total 


4189 


5503 



1. Airborne Multispectral Scanner (AMSS) with 11 spectral data channels (10 
channels from 380 to 1100 nm and 1 channel from 8 to 14 /im). 

2. Steep Mode Synthetic Aperture Radar (SAR) with 4 data channels (X-HH, 
X-HV, L-HH, L-HV). 

3. Shallow Mode SAR with 4 data channels (X-HH, X-HV, L-HH, L-HV). 

4. Elevation data (1 data channel, where elevation in meters = 61.996 -I- 7.2266 
* pixel value). 

5. Slope data (1 data channel, where slope in degrees = pixel value). 

6. Aspect data (1 data channel, where aspect in degrees = 2 * pixel value). 

There are 19 information classes in the ground reference map provided by 
OCRS. In the experiments, only the six largest ones were used, as listed in Table 
1. Here, training samples were selected uniformly, giving 10% of the total sample 
size. All other known samples were then used as test samples. To obtain baseline 
results for the multiple classifiers, several single classifiers were applied to the 
data. Two conventional statistical methods were used to classify the data: the 
MED and the Gaussian maximum likelihood method (ML) [13]. A conjugate 
gradient backpropagation (CGBP) algorithm [4] with two and three layers was 
also trained on the data with different numbers of hidden neurons (0, 15, 30, and 
45 hidden neurons). Each version of the GGBP network was trained six times 
with different initializations and the overall average accuracies were computed 
in each case. This was also compared to the base classifiers which were used for 
bagging and boosting. These base classifiers were a decision table [14] and the J48 
decision tree [15] which is a version of the G4.5 decision tree [16], frequently used 
in pattern recognition. The results of these classifications are shown in Tables 2 
(training) and 3 (test). 

In Tables 2 and 3, the conventional classification methods, the MED and 
ML showed different characteristics. The MED was not acceptable in terms of 
classification accuracies, but the ML accuracies were relatively good, especially 
considering that the data are clearly not Gaussian [12]. The J48 method outper- 
formed all methods in terms of both training and test accuracies and achieved an 
overall accuracy for test data of 70.8%. The test accuracy of the decision table 
was somewhat lower than that of the GGBP neural network which achieved a 
test accuracy of 68.8%. 
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Table 2. Training Accuracies in Percentage for the Single Classifiers Applied to the 
Anderson River Data Set. 



Method 


Class 1 


Class 2 


Class 3 


Class 4 


Class 5 


Class 6 


Average 

Accuracy 


Overall 

Accuracy 


MED 


40.4 


8.9 


47.6 


67.7 


42.3 


72.4 


46.6 


50.5 


ML 


54.6 


31.6 


87.8 


90.9 


81.4 


73.3 


69.9 


68.2 


Decision Table 


78.7 


59.3 


76.8 


70.8 


75.7 


83.4 


74.1 


76.1 


j48 


93.9 


91.5 


93.8 


93.9 


96.2 


97.0 


94.4 


94.7 


CGBP (30 hidden neurons) 


72.2 


34.4 


67.2 


74.6 


79.2 


83.1 


68.4 


70.7 


Number of Samples 


971 


551 


548 


542 


317 


1260 




4189 



Table 3. Test Accuracies in Percentage for the Single Classifiers Applied to the An- 
derson River Data Set. 



Method 


Class 1 


Class 2 


Class 3 


Class 4 


Class 5 


Class 6 


Average 

Accuracy 


Overall 

Accuracy 


MED 


39.7 


8.9 


48.4 


70.2 


46.0 


71.7 


47.5 


50.8 


ML 


50.8 


27.7 


84.5 


81.9 


73.8 


72.0 


64.3 


65.1 


Decision Table 


73.8 


42.4 


66.5 


61.7 


72.8 


77.0 


65.7 


67.5 


j48 


71.2 


47.4 


69.2 


72.3 


74.8 


81.2 


69.4 


70.8 


CCBP (30 hidden neurons) 


71.9 


29.3 


67.5 


73.8 


79.3 


82.4 


67.4 


68.8 


Number of Samples 


1250 


817 


701 


705 


405 


1625 




5503 



3.1 Consensus Theory 

For the LOP and LOOP six data classes (corresponding to the information 
classes in Table 1) were defined in each data source. The AMSS and SAR data 
sources were modeled to be Gaussian but the topographic data sources were 
modeled by Parzen density estimation with Gaussian kernels. For the non-linear 
versions, two and three layer GGBP neural networks were utilized with different 
numbers of hidden neurons (0, 15, 25, 35, and 45 hidden neurons). As in experi- 
ment 1, the neural networks were trained six times with different initializations. 
Then, the average of these six experiments was computed. The overall classifi- 
cation accuracies for the different consensus theoretic methods are summarized 
in Tables 4 (training) and 5 (test). In the tables the average result for the best 



Table 4. Training Accuracies in Percentage for the Consensus Theoretic Classifiers 
Applied to the Anderson River Data Set. 



Method 


Class 1 


Class 2 


Class 3 


Class 4 


Class 5 


Class 6 


Average 

Accuracy 


Overall 

Accuracy 


LOP (equal weights) 


49.6 


0.0 


0.0 


51.5 


0.0 


94.9 


32.7 


47.6 


LOP (heuristic weights) 


68.2 


0.0 


0.0 


73.1 


24.3 


89.4 


42.5 


54.0 


LOP (optimal linear weights) 


69.8 


42.7 


81.20 


77.5 


70.4 


78.9 


70.1 


71.5 


LOP (optimized with CGBP) 


69.0 


45.0 


81.3 


76.9 


85.0 


78.4 


72.6 


71.8 


LOGP (equal weights) 


68.7 


28.1 


79.6 


78.8 


81.7 


74.3 


68.5 


68.8 


LOGP (heuristic weights) 


68.9 


33.2 


78.5 


79.5 


75.7 


75.8 


68.6 


69.4 


LOGP (optimal linear weights) 


71.9 


40.3 


79.7 


75.1 


82.0 


79.1 


71.4 


72.1 


LOGP (optimized with CGBP) 


81.2 


56.0 


84.3 


88.7 


91.7 


86.4 


81.4 


81.6 


Number of Samples 


971 


551 


548 


542 


317 


1260 




4189 
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Table 5. Test Accuracies in Percentage for the Consensus Theoretic Classifiers Applied 
to the Anderson River Data Set. 



Method 


Class 1 


Class 2 


Class 3 


Class 4 


Class 5 


Class 6 


Average 

Accuracy 


Overall 

Accuracy 


LOP (equal weights) 


49.8 


0.0 


0.0 


50.4 


0.0 


95.3 


32.6 


45.8 


LOP (heuristic weights) 


68.9 


0.0 


0.0 


73.1 


20.8 


89.3 


42.0 


53.9 


LOP (optimal linear weights) 


66.4 


34.3 


78.5 


74.8 


72.6 


79.5 


67.7 


68.6 


LOP (optimized with CGBP) 


67.1 


36.7 


77.3 


75.1 


83.4 


77.6 


69.5 


69.2 


LOGP (equal weights) 


67.9 


23.1 


77.8 


77.5 


81.2 


73.7 


66.9 


66.4 


LOGP (heuristic weights) 


69.0 


31.8 


75.9 


78.6 


75.6 


75.1 


67.6 


68.6 


LOGP (optimal linear weights) 


68.6 


32.4 


75.2 


71.2 


81.7 


80.1 


68.2 


68.7 


LOGP (optimized with CGBP) 


75.4 


43.1 


76.9 


79.5 


87.2 


82.1 


74.0 


74.1 


Number of Samples 


1250 


817 


701 


705 


405 


1625 
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implementation of the methods which are based on the neural networks is shown 
in each case. The following heuristic weights were used for the LOP: AMSS: 1.0, 
SAR Steep Mode Data: 0.8, SAR Shallow Mode Data: 0.8, Elevation data: 1.0, 
Slope data: 1.0, and Aspect data: 1.0. The heuristic weights for the LOOP were: 
AMSS: 1.0, SAR Steep Mode Data: 1.0, SAR Shallow Mode Data: 1.0, Elevation 
data: 0.0, Slope data: 0.0, and Aspect data: 0.0. A pseudo inverse method [4] 
was used as the optimal linear weighting for the consensus theoretic methods. 

From the results in Tables 4 and 5 it is clear that the LOGP optimized 
with a neural network outperformed all other consensus theoretic methods in 
terms of overall and average training and test accuracies. It is noteworthy that 
the CGBP optimization increased the overall accuracies of the equally weighted 
LOGP by approximately 12% (training) and 6% (test), and the LOGP with 
non-linearly optimized weights outperformed easily the best single stage neural 
network classifiers both in terms of training and test accuracies. In contrast, the 
GGBP optimized LOP only gave comparable results to the single stage GGBP 
with 30 hidden neurons. However, the best GGBP optimized LOP results were 
achieved with 0 hidden neurons where the best GGBP optimized LOGP results 
were reached with 45 hidden neurons. These results are not surprising. The LOP 
is a linear combination of posterior probabilities but the LOGP is non-linear. 



3.2 Bagging 

Both bagging and boosting were run using the WEKA software provided by 
the University of Waikato, New Zealand [15]. In the case of bagging, 100 itera- 
tions were selected for the j48, and 30 iterations for the decision table. In both 
cases, the maximum test accuracy for the given base classifier seemed to have 
been reached. The results for bagging are shown in Tables 6 (training) and 7 
(test). As can be seen from these tables, the bagging algorithm improved on 
the best training or test results given by LOGP when the j48 base classifier is 
used. This result comes as no surprise since decision tree classifiers are typical 
unstable classifiers which should perform well in classification by the bagging 
algorithm. Bagging based on the decision table does almost as well in terms of 
test accuracies, and in fact, slightly better than j48 after 30 iterations. 
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Table 6. Training Accuracies in Percentage for the the Bagging Method Applied to 
the Anderson River Data Set. 



Base Classifier 


Class 1 


Class 2 


Class 3 


Class 4 


Class 5 


Class 6 


Average 

Accuracy 


Overall 

Accuracy 


Decision Table 


97.6 


91.5 


97.6 


96.9 


100.0 


99.0 


97.1 


97.3 


j48 


98.7 


96.2 


97.4 


98.3 


99.4 


99.3 


98.2 


98.4 


Number of Samples 


971 


551 


548 


542 


317 


1260 




4189 



Table 7. Test Accuracies in Percentage for the Bagging Method Applied to the An- 
derson River Data Set. 



Method 


Class 1 


Class 2 


Class 3 


Class 4 


Class 5 


Class 6 


Average 

Accuracy 


Overall 

Accuracy 


Decision Table 


80.7 


48.2 


82.9 


77.3 


89.6 


85.5 


77.4 


77.8 


j48 


80.0 


51.2 


81.3 


79.6 


86.4 


87.5 


77.7 


78.5 


Number of Samples 


1250 


817 


701 


705 


405 


1625 




5503 



3.3 Boosting 

For boosting, the Adaboost.Ml, with 100 iterations for j48 was selected. After 
19 iterations the boosting of the decision table aborted. This demonstrates how 
strict the demand for 50% accuracy is for multiclass problems. The results for 
the Adaboost.Ml are shown in Tables 8 (training) and 9 (test). As can be seen 
from these results, the Adaboost.Ml algorithm improved on the results given by 
bagging in the case of the j48 classifier, but not the decision table. The j48 base 
classifier results are outstanding. These results are the best accuracies achieved 
for the whole experiment. It is of interest to note that the best test accuracies 
were achieved 95 iterations after the training accuracy reached 100%. 



Table 8. Training Accuracies in Percentage for the AdaBoost.Ml Method Applied to 
the Anderson River Data Set. 



Base Classifier 


Class 1 


Class 2 


Class 3 


Class 4 


Class 5 


Class 6 


Average 

Accuracy 


Overall 

Accuracy 


Decision Table 
j48 


99.5 

100.0 


97.3 

100.0 


99.1 

100.0 


99.3 

100.0 


99.4 

100.0 


99.7 

100.0 


99.0 

100.0 


99.2 

100.0 


Number of Samples 


971 


551 


548 


542 


317 


1260 




4189 



4 Conclusions 

In this paper, three multiple classification schemes were looked at. All three 
schemes worked well and outperformed several single classifiers in terms of accu- 
racies. Therefore, the results presented here demonstrate that multiple classifica- 
tion methods can be considered desirable alternatives to conventional classifica- 
tion methods when multisource remote sensing data are classified. In particular, 
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Table 9. Test Accuracies in Percentage for the AdaBoost.Ml Method Applied to the 
Anderson River Data Set. 



Base Classifier 


Class 1 


Class 2 


Class 3 


Class 4 


Class 5 


Class 6 


Average 

Accuracy 


Overall 

Accuracy 


Decision Table 


77.3 


51.7 


73.9 


75.0 


83.7 


85.4 


74.5 


75.6 


j48 


83.0 


54.2 


81.9 


81.4 


88.9 


88.9 


79.7 


80.6 


Number of Samples 


1250 


817 


701 


705 


405 


1625 




5503 



the AdaBoost.Ml method performed well when a j48 decision tree was used as its 
base classifier, and was the most accurate classifier both in terms of training and 
test accuracies. The AdaBoost.Ml did not demonstrate overtraining although 
it achieved 100% training accuracy. The simpler bagging algorithm performed 
better than AdaBoost.Ml in the case of the decision table base classifier, where 
the AdaBoost.Ml aborted after only 19 iterations. Bagging doesn’t suffer from 
the restriction of needing at least 50% accuracy and has the further advantage of 
needing not as much computational resources as the other methods. The LOGP 
consensus theoretic classifier performed well in experiments. Consensus theoretic 
classifiers have the potential of being more accurate than conventional multivari- 
ate methods in classification of multisource data since a convenient multivariate 
model is not generally available for such data. Also, consensus theory overcomes 
two of the problems with the conventional maximum likelihood method. First, 
using a subset of the data for individual data sources lightens the computa- 
tional burden of a multivariate statistical classifier. Secondly, a smaller feature 
set helps in providing better statistics for the individual data sources, when a 
limited number of training samples is available. 
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Abstract. Recently, we have proposed an algorithm for construction of a 
hierarchy of neural network classifiers based on a modification of error 
backpropagation. It combines supervised learning with self-organization. 
Recursive use of the algorithm results in creation of compact and 
computationally effective self-organized stmctures of neural classifiers. The 
algorithm is applicable for unsupervised analysis of both static objects and 
dynamic objects, described by time series. In the latter case, the algorithm 
performs segmentation of the analyzed time-series into parts characterized by 
different types of dynamics. The algorithm has been successfully tested on 
pseudo-chaotic maps. In this paper the above algorithm is applied to Solar wind 
data analysis. Preliminary results indicate that new structural classes in the 
Solar wind could be distinguished aside from the traditional two- and three-state 
concepts. 



1 Introduction 

Hierarchical approach is often used in complex classification tasks, splitting a 
complex problem into a number of more simple ones. Recently, the algorithm for 
construction of a hierarchy of neural network classifiers (HNNC) was suggested [1]. 
The underlying idea of the algorithm is to use erroneous classifications during neural 
network (NN) training, for determination of classes that are „similar“ in some sense. 
Such „similar“ classes (in fact, the classes that can not be separated by NN) form a 
cluster of classes, simplifying the classification task at a given level of hierarchy. At 
the next level of hierarchy, another NN may be used for separation of classes assigned 
to this cluster. 

The above algorithm is applicable for classification of static objects in a 
straightforward manner. The same algorithm may be expanded for the analysis of 
dynamic objects, described by time series with switching dynamics. It is assumed that 
the analyzed dynamic object possesses the following features: 

• It has several unknown types of dynamics, while there is no a priori information 
about the types themselves and about the number of such types; 

• At each moment the object is described by only one type of dynamics; 

• Switching between the types of dynamics can occur at arbitrary time moments; 
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• The duration of switching is negligible in comparison with the intervals between 
switching. 

The task of time series analysis within the approach developed is to perform 
unsupervised segmentation of the analyzed time-series into parts characterized by 
different types of dynamics. Such analysis may help to determine both the actual 
number of dynamics types and the moments of dynamics switching. In spite of the 
above limiting assumptions, this model may potentially match numerous practical 
problems, say, EEG analysis in medicine, continuous speech segmentation, stock 
market analysis, etc. 

In this paper, the algorithm is applied to the analysis of plasma processes in the 
Earth magnetosphere and in interplanetary space, and the first results of Solar wind 
data analysis using the HNNC algorithm are presented. 



2 Statement of the Problem 

Space physics data include information about hourly averaged magnitudes of 
interplanetary magnetic field, and about velocity, density, and temperature of Solar 
wind, measured on the Earth orbit. Many original papers, reviews and books are 
describing different types and morphological characteristics of the Solar wind (e.g., 
[3]). Historically, the first classification attempts were based on the empirical findings 
of the "quiet” and "perturbed" solar wind states, "fast" and "slow" streams, "hot" and 
"cold" plasma, "low" and "high" density, etc. Later, some correlations appeared more 
clearly between different solar wind parameters and states. Though many authors 
advocate the concept of the solar wind as „a two-state phenomenon", this 
classification is approximately valid only for the "unperturbed" quasistationary 
situations and can be often and strongly violated by the solar activity processes. 
Because of this, in the more recent literature three characteristic types of Solar Wind 
are discussed. The "third" class statistically is much less pronounced, and all three 
types are often overlapped in the statistical sense. 

The solar activity develops with time in a complicated manner that is not 
completely understood and investigated. The corresponding classification of regimes 
and transitions between them is far from being established and belongs to the most 
important and interesting tasks of the current studies. The problem is very 
complicated because of the nonlocal, nonlinear, nonstationary and highly structured 
multi-scale nature of the solar activity driven by different energy, momentum and 
mass transport processes. Dozens of different regimes can be indicated theoretically, 
but their identification in observations is not easy because of the turbulent character of 
the processes in the solar wind. 

From practical point of view, the most interesting phenomena are temporal 
dynamics of Solar wind, and some processes during interaction of the Earth 
magnetosphere with Solar wind (global deformation of magnetosphere, magnetic 
storms, etc). In this context. Solar wind dynamics analysis by the HNNC algorithm 
may be useful for detection of its characteristic types. 
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3 Description of the HNNC Algorithm 

Multi-layer perceptron (MLP) is frequently used for solving classification and 
recognition tasks. In practice, the most critical problems of MLP training are 
unknown optimal number of hidden neurons and possibility of sticking in a local 
minimum. Typical solution is to repeat training several times. However, MLP training 
is computationally expensive. Taking into account that hidden layer size required for 
solving complex tasks may be large enough, even a single re-training of MLP may 
become unacceptably long. 

The algorithm presented here pretends to solve (or facilitate at least) the above 
problems. It is based on a modification of error backpropagation training of MLP. 

Let us consider the process of MLP training as simultaneous feature extraction in 
the hidden layer and decoding of these features in the output layer. For the simplest 
MLP architecture, the number of extracted features may be considered to be 
proportional to the number of neurons in the hidden layer. Then, a small number of 
neurons in the hidden layer may lead to the situation, when correct recognition of all 
initial classes will be impossible, and some classes will be considered „similar“. Thus, 
the MLP architecture may be used to join classes into groups on the basis of some 
features. 

The suggested algorithm for MLP training consists of three stages. 

At the first stage, the desired output for all patterns from j-th class consists of 1 for 
j-th output neuron and of 0 for all other neurons, and training is performed using usual 
error back-propagation method. 

The second stage is invoked every T training epochs. At this moment, the 
algorithm performs analysis of statistics of MLP's answers on the training set. One of 
possible methods of analysis is based on patterns "voting", and it is implemented as 
follows. Let us denote number of classes as C, i-th class as the amplitude of k-th 
output neuron as Y^, and voting threshold as V. The pattern "votes" for belonging to 
class j if j=argmax(Y, i=l,...C) and if Y>V. The value of V is set either a priori, or 
using some objective indicator of network answers confidence. 

Then a simple majority voting is done within each class. If the number of patterns 
that voted for belonging to is greater than half of number of patterns in Cj, then all 
representatives of are considered belonging to C^. In fact, this procedure results in 
formation of groups of classes that are not separable by a given NN. 

At the third stage, the desired output for each class is modified according to the 
voting results, and the training proceeds. 

The stages training - voting - modification are repeated until the classes cease to 
join and until the recognition error reaches acceptable level. This procedure results in 
clustering the input data and, at the same time, in formation of the classifier that 
supports exactly that clustering. 

The above feature is critical for the HNNC algorithm. The structure of a HNNC is 
not chosen a priori. Its construction starts from the base node. Each node is 
implemented as a MLP and is trained using the above procedure. All initial classes 
joined into one group form a separate branch of the tree. After creation of the base 
node, the same procedure is used for each branch. Within a node, each of the initial 
classes assigned to the corresponding branch of the tree is again considered as a 
separate class. The process is repeated until each branch contains one initial class. 
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By varying algorithm parameters (MLP hidden layer size H, learning rate r, 
momentum m, period T of analysis of answers statistics, and voting threshold V), it is 
possible to change the resulting number of classes in each node, thus controlling 
topology of the hierarchical tree constructed. 

Some shortcoming of the algorithm is the dependence of the resulting structure on 
initialization of MLP weights, as discriminative features extracted during the process 
of HNNC construction depend on starting point. Nevertheless, such dependence does 
not necessarily result in poor performance of the HNNC constructed. 

The algorithm performance has been tested on different real-world problems of 
classification of static patterns (printed letters, textures, spectrograms of isolated 
words and vowels), and the algorithm has demonstrated high efficiency [2]. For a 
well-known benchmark task of speaker-independent recognition of 11 steady state 
English vowels [4], the best recognition rate of the HNNC (H=l, r=0.01, m=0.9, T=l, 
V=0.2) on the test set was 58%. At the same time, the best recognition rate for a 
single MLP (H=60, r=0.01, m=0.9) was about 52%. Total number of weights in the 
HNNC mentioned above was equivalent to that of MLP with 7 hidden neurons. The 
same HNNC applied to the test set corrupted by 20% white noise outperformed all the 
MLPs tested, slightly degrading to recognition rate of 54%. 



4 Time Series Analysis Using HNNC Algorithm 

The underlying idea of using the algorithm for time-series analysis is the following 
[2]. The time-series is divided into segments of equal length, and dynamics describing 
each such segment is considered to be fixed. Under the assumption that switching 
between different types of dynamics is instant and rather rare, each segment is at first 
considered belonging to a separate class with its own dynamics. 

Next, the HNNC algorithm is applied to the analyzed time-series. As the result, any 
segment of a time series may be reassigned to another class. Due to the ability of the 
algorithm to join similar classes, segments with similar or the same types of dynamics 
are attributed to the same class. This assignment is done without any a priori 
information. 

In fact, the task of time series analysis is not a classification task, but an 
unsupervised segmentation. In such statement, the process of HNNC creation 
continues until no further classes separation is possible. A group of classes is 
considered non-separable if none of these classes can be separated from other classes 
in the group with recognition rate on train set better than 75%. 

Some papers (e.g., [5]) describe neural network approaches to the analysis of time 
series with switching dynamics. However, in these approaches the number of neural 
networks is set in advance, so the whole structure of networks is non-adaptive. Also, 
the networks work in parallel instead of hierarchical organization. 

Recently, the HNNC algorithm was tested on the model task of unsupervised 
segmentation of time series constructed using pseudo-chaotic maps [6]. Four well- 
known pseudo-chaotic sequences were used for time series generation; 

logistic map: f(x)=4x(l-x) for xg [0, 1]; 

tent map: f(x)=2x for xg[ 0, 0.5), and f(x)=2(l-x) for xg[0.5, 1]; 
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double logistic map, and double tent map (latter two are produced by recursion of 
logistic map and tent map, respectively). 

These sequences alternated, producing from 25 to 100 points each, while the total 
size of a training set made from 1000 to 2000 points, depending on the statement of 
the experiment. The algorithm was tested in different conditions, particularly, with 
unbalanced classes, when some chaotic maps produced much more points than other 
maps (1:27 in the worst case). Input pattern was formed of 5 sequential points from a 
time series, with the rightmost point of the window determining what class the pattern 
belonged to. Parameters settings were H=20-40; r=0.01-0.5; m=0.9; T=10; V=0.2. 
Usually, larger values of H and r were used at the first level of hierarchy. Typical 
HNNC consisted of 2-3 levels of hierarchy, separating segments of different 
dynamics with recognition rate 96-99% on train set, and about 85% on test set. For 
comparison, MLP with 4 classes (H=25, r=0.01, m=0.9) was trained using a priori 
information about model data (what is not available in practice). Recognition rate was 
about 95% on train set, and about 88% on test set. The results have shown that the 
HNNC algorithm was promising for time series segmentation and analysis. 



5 Solar Wind Data Analysis 

5.1 Data Preparation 

Multi-factor statistical analysis of Solar activity dynamics and of Solar wind 
parameters presented by hourly averaged time series, was done recently [7]. 
According to this analysis, the characteristic time scale of dynamics that is going to be 
discovered by the developed algorithm, is equal to 27 days. 

Time series data analyzed by the proposed algorithm, consisted of hourly averaged 
magnitudes of Solar Wind velocity during the period of March 1974-March 1975 
(8760 points), taken from [8]. Due to substantial gaps in data (about 20% of points), a 
special procedure of gap filling was used. First, the gaps were filled linearly with 
some additive noise, and loglO of the resulting time series was calculated. Second, 
coefficients of wavelet transformation Daubechies-4 [9] over 512 points (i.e. 512 
hours, constituting about 3 weeks) were found, and inverse wavelet transformation 
with rejection of coefficients lower than 30% of maximal was done. Filtered curve 
was smoothed by averaging over 6 hours, and moving average over 128 hours was 
subtracted. The resulting time series is presented in Fig.l. 

In order to obtain compact representation of the analyzed time series dynamics, the 
first 4 coefficients of wavelet transformation Daubechies-4 were calculated over 128 
points (128 hours, constituting about 5 days). These coefficients along with their 13 
lags (taken with step 48 hours) formed 56 inputs for NN. Maximal lag was 624 hours, 
and the size of time window during analysis was about 27 days. 

Solar wind data (as many other practical tasks) has no expert segmentation, so no 
test set is available. Nevertheless, recognition rate on train set still describes the 
segmentation quality. Let us denote i-th segment of a time series as Sp and assume 
that a segment was assigned to class C^. In a sense, classification of patterns from Sj 
into Q may be considered as a „correct“ one, and the recognition rate is the number 
of patterns from that were classified into Q, divided by total number of patterns in 
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Fig. 1. Solar wind velocity, logarithmic scale (8760 hourly averaged points, March 74 - March 
75). Distance between vertical lines is 648 hours (27 days, one period of Sun rotation). 



Sj. Results presented below were calculated on train set. For the same reason (no 
expert segmentation) no comparison with a single MLP was done. 



5.2 Results 

In the first experiment, the time series was divided into 14 segments 648 points each 
(segment length corresponded to 27 days). Each segment was associated with one of 
the initial classes. Thus, the base network had 56 inputs and 14 outputs. 

The HNNC constructed in this experiment is shown in Fig. 2. The algorithm 
segmented the analyzed time series into 7 groups (7 terminal nodes in Fig. 2). Fig.3 
presents the resulting segmentation of the time series. It shows the number of group 
each point was assigned to vs time, with groups numbered as follows: group #1 - 
segments 2, 6; group #2 - segments 8, 12; group #3 - segments 3, 7; group #4 - 
segments 1, 11; group #5 - segments 5, 10; group #6 - segments 4, 13, 14; group #7 - 
segment 9. 

Unstable segmentation in group #3 (3 and 7 segments, see Fig.3), apparently, may 
be explained by the fact that these segments correspond to transition processes in 
studied dynamics (remind, that one of the basic assumptions of the algorithm is 
instant switching from one type of dynamics to another). 

Possible solutions for this problem may be, first, expansion of the algorithm for the 
case of smooth drift between dynamics types, second, taking alternative information 
into account (e.g. Solar wind density and temperature), and third, decrease of segment 
length. In order to verify the latter supposition, further experiments were performed 
with the time series divided into 27 segments 324 points each. Thus, the base network 
had 56 inputs and 27 outputs. 

The HNNC obtained in one of the experiments is shown in Fig. 4. All nodes of this 
HNNC were built using the parameters H=2, m=0.9, T=10, V=0.2. The value of 
learning rate was r=0.5 for the base node, and r=0.1 for all other nodes. 

The algorithm segmented the analyzed time series into 5 groups. Groups were 
numbered as follows: group #1 - segments 1, 3, 11, 15, 21, 23; group #2 - segments 2, 
10, 20, 22; group #3 - segments 4, 6, 7, 8, 12, 13, 16, 17, 24, 25, 27; group #4 - 
segments 5, 9, 14, 19, 26; group #5 - segment 18. Bottom chart in Fig.5 presents the 
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Fig. 2. HNNC constructed for partitioning of the time series into 14 segments 648 points each. 
Each node of the HNNC contains the list of numbers of segments assigned to this node. Each 
non-terminal node contains also the parameters used during its training (T=10, V=0.2 were 
used for all nodes), and recognition rate, which was calculated only for the segments assigned 
to this node. Overall recognition rate was 83.66%. 
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Fig. 3. The time series segmentation obtained with the HNNC presented in Eig.2. Vertical lines 
denote segments boundaries. 



resulting segmentation of the time series. It is clearly seen that the segmentation is 
substantially more stable than in Fig. 3, confirming our hypothesis that the segment 
length was too large in the first experiment. 

Top chart in Fig.5 gives another view on the same segmentation. The number of 
patterns assigned by the HNNC to each of 5 groups, was calculated within a moving 
window of 100 points width. Resulting 5 curves (each curve amplitude ranges from 0 
to 100) may be treated as confidence of the HNNC in its decision regarding the type 
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Fig. 4. HNNC constructed for partitioning of the time series into 27 segments 324 points each. 
Overall recognition rate was 90.96%. Notations are the same as in Fig.2. 
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Fig. 5. Time series segmentation obtained with the HNNC presented in Fig.4 (bottom). 
Dependence of the number of patterns assigned to groups formed at the last level of hierarchy 
vs time (top; curves a-e correspond to groups 1-5, respectively). Vertical lines denote segments 
boundaries. 
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of the time series dynamics at a given time. In the case of 100% confidence, all 
patterns within a window should be assigned to a single group. One can see rather 
confident answers of the HNNC in most segments. However, the HNNC confidence 
within segments 5, 6, and 13 decreases substantially. It is interesting to note that these 
segments correspond to the ones where unstable segmentation in the previous 
experiment was obtained. 

The HNNC constructed in this experiment is more balanced and more symmetric 
than the HNNC in Fig.2, and overall recognition rate is higher (90.96%). 

The stability of segmentation results was also investigated. The HNNC algorithm 
was used for the same time series and with the same parameters as for HNNC in 
Fig.4, but with another weights initialization. Overall recognition rate of the 
constructed HNNC was 92.14%, and the segmentation obtained was very similar to 
the one in Fig. 5. Apparently, it may demonstrate that the algorithm produces rather 
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stable results, and that the segmentation obtained reflects some inherent features of 
the analyzed time series. 

Analysis of typical waveforms in each of the groups obtained (cf. Fig. 1 and Fig. 5) 
gives rise to the hypothesis that groups #1 and #2 correspond to the periods of 
„quiet“ Solar wind (these periods are characterized by evident transitions from 
maximal magnitudes to minimal ones); groups #3 and #4 correspond to the periods of 
„perturbed“ Solar wind (these periods include numerous peaks of intermediate 
magnitudes). Group #5 (segment 18) corresponds to the untypical situation, when 
Solar wind velocity changed in a narrow range of magnitudes. 

Obviously, we may treat each group number as a number of current state of the 
dynamic system. Let us denote sequences of states by numbers of corresponding 
groups, e.g. «l-3» denotes transition from state #1 to state #3. Then, several 
sequences of states of different length may be detected in the segmentation (Fig. 5, 
bottom chart), repeating with the step of half a year (one of well-known Solar cycles). 
The length of such sequences ranges from 2 segments (sequence «l-3», segments 3-4 
and 15-16), to 3 segments (sequence «4-2-l», segments 9-11 and 19-21), and even 5 
segments (sequence «2-l-3-3-5», segments 10-14 and 22-26). Our hypothesis is that 
the HNNC algorithm not only permits to detect different types of dynamics, but also 
may help to reveal rather long sequences of transitions from one type of dynamics to 
another. 

On the basis of earlier investigations [7], we may suggest some interpretation of 
the transitions between groups recognized by the HNNC (Fig. 5). There could be 
additional structural elements and classes in Solar wind morphology aside from the 
simplified two- and three-state concept. The hypothesis is that the groups found 
reflect global asymmetry of the Solar wind emerging from its different sources in the 
corona. Namely, we suppose that the solar wind from northern and from southern 
coronal holes had different parameters during the period of time in 1974-1975. To 
verify this hypothesis, the polarity and geometry of the heliospheric magnetic field 
should be included in further analysis. 



6 Conclusion 

This paper presents the preliminary results of Solar Wind data analysis using 
hierarchical neural network classifiers (HNNC). The HNNC algorithm is based on a 
modification of error backpropagation combining supervised learning with self- 
organization. Recursive use of the algorithm builds a hierarchical tree during the 
process of training, resulting in creation of a self-organized structure of neural 
classifiers. The HNNC algorithm was expanded for the analysis of dynamic objects, 
described by time series with switching dynamics. The task of time series analysis 
within the approach developed is to perform unsupervised segmentation of the 
analyzed time-series into parts characterized by different types of dynamics. 

In this paper, the algorithm is applied to the analysis of hourly averaged velocity of 
Solar Wind. We suppose that the segmentation obtained reflects some inherent 
features of the analyzed time series, distinguishing periods of „perturbed“ and „quiet“ 
Solar wind. The algorithm permits to detect different types of dynamics and may help 
to reveal rather long sequences of transitions between types of dynamics. Preliminary 
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results indicate that new structural classes in the Solar wind could be distinguished 
aside from the traditional two- and three-state concepts. 

Future development of the algorithm includes its expansion for the analysis of a 
more complex case of smooth drift between dynamics types. 

This work was supported in part by RFBR (Russian Foundation for Basic Research) 
(grant number 01-01-00925). 
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Abstract. In the problem of one-class classification target objects 
should be distinguished from outlier objects. In this problem it is as- 
sumed that only information of the target class is available while noth- 
ing is known about the outlier class. Like standard two-class classifiers, 
one-class classifiers hardly ever fit the data distribution perfectly. Using 
only the best classifier and discarding the classifiers with poorer perfor- 
mance might waste valuable information. To improve performance the 
results of different classifiers (which may differ in complexity or training 
algorithm) can be combined. This can not only increase the performance 
but it can also increase the robustness of the classification. Because for 
one-class classifiers only information of one of the classes is present, com- 
bining one-class classifiers is more difficult. In this paper we investigate 
if and how one-class classifiers can be combined best in a handwritten 
digit recognition problem. 



1 Introduction 

The goal of the Data Description (or One-Class Classification m) is to distin- 
guish between a set of target objects and all other possible objects (per definition 
considered outlier objects). It is mainly used to detect new objects that resemble 
a known set of objects. When a new object does not resemble the data, it is likely 
to be an outlier or a novelty. When it is accepted by the data description, it can 
be used with higher confidence in a subsequent classification. 

Different methods have been developed to make a data description. In most 
cases the probability density of the target set is modeled H2!- This requires 
a large number of samples to overcome the curse of dimensionality Other 
techniques than estimating a probability density estimate exist. It is possible to 
use the distance p to model or just to estimate the boundary around the class 
without estimating a probability density. A neural network can be restricted 
to form a closed decision surface pm, various forms of vector quantization 0 
are possible and recently a method based on the Support Vector Classifier, the 
Support Vector Data Description m was proposed. 

As in the normal classification problems, one classifier hardly ever captures 
all characteristics of the data. Combining classifiers can therefore be considered. 
Commonly a combined decision is obtained by just averaging the estimated pos- 
terior probabilities. This simple algorithm already gives very good results PH. 
This is somewhat surprising, especially considering the fact that averaging of 



J. Kittler and F. Roll (Eds.): MCS 2001, LNCS 2096, pp. 299- tTOI 2001. 
© Springer- Verlag Berlin Heidelberg 2001 



300 



D.M.J. Tax and R.P.W. Duin 



the posterior probabilities is not based on some solid (Bayesian) foundation. 
When the Bayes theorem is adopted for the combination of different classifiers, 
a product combination rule automatically appears under the assumption of in- 
dependence: the outputs of the individual classifiers are multiplied and then 
normalized (this is also called a logarithmic opinion pool 

One-class classifiers cannot provide posterior probabilities for target objects, 
because information on the outlier data is not available. When a uniform distri- 
bution over the feature space is assumed, posterior probability can be estimated 
when the target class probability is found. When a one-class classifier does not 
estimate a density, its output should be mapped to a probability before it can 
be combined with other classifiers. In this paper we investigate the influence of 
the feature sets (are they dependent or not) and the type of one-class classifiers 
for the best choice of the combination rule. 



2 Theory 



We assume that we have data objects x^, i = which are represented in sev- 
eral feature spaces k = Each object can be a target object, labeled ujt, 

or an outlier object uiq (although during the training of one-class classifiers we 
assume example outlier objects are not available). In each feature space different 
one-class classifiers are trained. In 0 and in a theoretical framework for com- 
bining (estimated posterior probabilities from) normal classifiers is developed. 
For different types of combination rules derivations are obtained. When classi- 
fiers are applied on (almost) identical data representations Xi = X 2 = = X/j, 

the classifiers estimate the same class posterior probability p{ujj\x^), potentially 
suffering from the same noise in the data. To suppress the errors in these esti- 
mates and the overfitting by the individual classifiers, the classifier outputs may 
be averaged. This results in the mean combination rule: 

/,(x\...,x«) = lf]/j=(x'=) (1) 

^ fc=l 



where j indexes the target and outlier class. 

On the other hand, when independent data representations Xi are available, 
classifier outcomes should be multiplied to gain maximally from the independent 
representations. This results in the product combination rule: 






...,x«) 






( 2 ) 



2.1 One-Class Classifiers 

One-class classifiers are trained to accept data from the target class and to 
reject outlier data. We can distinguish two types of one-class classifiers. The first 
type are the density estimators, which just estimate the target class probability 
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density p(x|wt)- In this paper we use a normal density, a mixture of Gaussians 
and the Parzen density estimation. 

The second type of methods fit a model to the data and compute the distance 
P't(x) to this model. Here we will use four simple models, the support vector 
data description d, k-means clustering, k-center method d and an auto- 
encoder neural network Here a descriptive model is fitted to the data and 
the resemblance (or distance) to this model is used. In the SVDD a hypersphere 
is put around the data. By applying the kernel trick (analogous to the support 
vector classifier) the model becomes more flexible to follow the characteristics 
in the data. Instead of the target density the distance to the center of the hyper 
sphere is used. In the k-means and k-center method the data is clustered, and the 
distance to the nearest prototype is used. Finally in the auto-encoder network 
the network is trained to represent the input pattern at the output layer. The 
network contains one bottleneck layer to force it to learn a (nonlinear) subspace 
through the data. The reconstruction error of the object in the output layer is 
used as distance to the model. 

2.2 Posterior Probabilities for One-Class Classifiers 

To make an accept/reject decision in all the one-class methods, a threshold 
should be set on the estimated probability or distance. A principled way for 
setting this threshold is to supply the fraction of the target set fx which should 
be accepted. This defines the threshold: 



where /() is the indicator function. In this paper it is assumed that for all 
methods the threshold is put such that fx of the target data is accepted {fx = 
0.9). 

When one-class classifiers are to be combined based on posterior probabilities, 
Bayes rule should be used to compute p{ujx\^) from p{x\ujx)' 



Because the outlier distribution p(x|wo) is unknown, and even the prior proba- 
bilities p{u>x) and p{oJo) are very hard to estimate, equation cannot be used 
directly. The problem is solved when an outlier distribution is assumed. When 
p(x|wo) is independent of x, i.e. it is an uniform distribution in the area of the 
feature space that we are considering, p(x|o;t) can be used instead of p{wx\yi)- 
Regardless of the fact if a one-class classifier estimates a density or a recon- 
struction error (distance), for all types the chance of accepting and rejecting a 
target object, p(acc x|u;t) and p(rej x|wr), are available. Then p(o;t|x) is ap- 
proximated by just two values, fx and 1 — fx- The binary outputs of the one-class 
methods can be replaced by these probabilities. Using just the binary output (ac- 
cept or reject) the different one-class methods can only be combined by majority 
voting. 




( 3 ) 



I , p{x\uJx)p{uJx) p{x\uJx)p{uJx) 

p(wt|x) = T— = — — — — 



p(x) p(x|wr)p(wT) +p(x|wo)p(wo) 



( 4 ) 
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When the more advanced combining rules are required (equations (^) or 
(0) p{x\ujt) should be available and a distance or resemblance p(x|wt) should 
be transformed to a resemblance. Therefore some heuristic mapping has to be 
applied. One possible transformation is: 

P(x|wt) = exp(-p(x|wT)/s) (5) 

(which models a Gaussian distribution around the model if p(x|wr) is a squared 
distance). The scale parameter s can be fitted to the distribution of p(x|wt)- 
Furthermore it has the advantage that the probability is always bounded between 
0 and 1. 



2.3 OC Combining Rules 

Given a set of R posterior probability estimates, the following set of combining 
rules can be defined: 

First the mean vote, which combines the binary (0-1) output labels: 

1 ^ 

E I{Pk{x\ujT) > Ok) (6) 

Here 9k is the threshold for method k. When the heuristic method for computing 
a probability Pfc(x|ciJT) from a distance p(x|wt) has to be used (equation 0), 
the original threshold for the method should also be mapped. For a threshold of 
0.5 this rule becomes a majority vote in a two class problem. 

The second combining rule is the mean weighted vote, where the weighting 
by fx,k and 1 — fr,k is introduced. Here fr,k is the fraction of the target class 
that is accepted by method k. 

1 ^ 

Vmwvi^) = D X! {fT,k^{Pk{^\^T) > Ok) + (1 - fT.^)I{Pk{^\^^T) < Ok)) (7) 

^ k^l 

This is a smoothed version of the previous version, but it gives identical results 
when a threshold of 0.5 is applied. 

The third is the product of the weighted votes: 

1 ^ 

ypii;«(x) = — /T,fc/(Hfe(x|wT) > fi'fe) (8) 

^ fc=l 

withZ = 0^=1 fT,kI{Pk{x\ujT) > Ok) + rife=i(l - /T,fc).f(^’fc(x|wr) < Ok) 

Finally the mean of the estimated probabilities: 

1 ^ 

2 /mp(x) = n E (9) 

^ k=l 

and the product combination of the probabilities: 
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ypp(x) 



rifcPfc(x|u;r) 

rife ^’fe(x|wT) + rife -Pfe(x|wo) 



( 10 ) 



Here we will use the approximation that the outliers are uniformly distributed 
Pfe(x|wo) = All these combining rules will be compared in a real world 
one-class problem in the next section. 



Morph (class 3) 




Fig. 1. ROC curves of the five combining rules. Individual classifiers are shown by 
stars. 



2.4 Error 

For the evaluation of one-class classifiers and the combination rules, we consider 
the Receiver-Operating Characteristic curve (ROC curve). It gives the target 
acceptance and outlier rejection rates for varying threshold values. Note that 
for estimating the outlier rejection rate, we need example outlier objects. An 
example is shown in figure Q Here the results for four individual classifiers trained 
on one identical feature set and five combination rules are shown. Because each 
classifier is trained for a 10% target rejection rate, the method is optimized 
for just one point on the ROC curve (ideally on the vertical line with 10% 
target rejection rate). These points indicated by the thick dot. The 2-dimensional 
curves are the ROC curves of the combining rules. The product combination rule 
performs best here, because for the same fraction of target objects rejected, less 
outlier objects are accepted than by other methods. 

To make comparisons between classifiers a 1-dimensional error is derived from 
this curve. This is called the Area Under the Curve (AUC) |2|, and it measures 
the total error integrated over (in principle) all threshold values. Because we are 
mainly interested in situations where we accept large fractions of the target set. 
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we use threshold values with a target rejection rate from 0.05 to 0.5. Although 
each classifier is optimized to reject 10% of the target data, during the evaluation 
of the combination rules, this complete range over the ROC curve is considered. 

2.5 Difference Mean and Product Rule 

In the combination of normal classifiers it appears that often the more robust 
average combination rule is to be preferred. Here extreme posterior probability 
estimates are averaged out. In one-class classification only the target class is 
modeled and a low uniform distribution is assumed for the outlier class. This 
makes this classification problem asymmetric and extreme target class estimates 
are not cancelled by extreme outlier estimates. 




Fig. 2. (Left) Five target probability density estimates which should be combined. 
(Right) Combination of the five target probability density estimates 



In figure 0 five one-class classifiers are shown for an artificial I-dimensional 
problem with data normally distributed round the origin (with unit variance). 
Due to some atypical training samples two of the classifiers are somewhat remote 
from the other three. In figure 0 the resulting estimates by the product and mean 
combination rules are shown. The mean combination covers a broad domain in 
feature space, while the product rule has restricted range. Especially in high 
dimensional spaces this extra area will cover a large volume and potentially a 
large number of outliers. 

This effect is observable in figure 0 For target rejection rates less than 20% 
the product combination rule accepts less outlier objects than the mean com- 
bination, or other combination rules. This indicates that the covered volume is 
less than for the other combining rules. 

3 Experiments 

We will apply the combining rules to one-class classifiers trained on a handwrit- 
ten digits dataset 0. This dataset consists of six feature sets: profile features. 
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Table 1. Results of all the individual classifiers on class 3. Results are multiplied by 
100 and averaged over 10 runs. The number between brackets indicates the standard 
deviation of the outcome. 





profile 


Fourier 


KL 


morph 


pixel 


Zernike 


Gauss 

MoG 

Parzen 


2.98 (1.33) 
3.46 (2.17) 
2.26 (0.89) 


3.37 (0.88) 
3.34 (0.75) 
2.50 (0.73) 


1.34 (0.77) 
1.66 (1.50) 
0.52 (0.35) 


10.76 (0.80) 
11.17 (0.82) 
11.03 (0.84) 


0.45 (0.34) 
0.43 (0.33) 
0.28 (0.27) 


6.23 (1.32) 
6.60 (1.27) 
3.97 (1.52) 


svdd 

kmeans 

kcenter 

autoenc 


7.84 (2.51) 
4.17 (1.84) 
4.16 (1.23) 
8.67 (3.93) 


3.75 (3.63) 
2.64 (2.04) 
3.78 (3.57) 
3.52 (2.93) 


5.13 (2.49) 
1.07 (0.47) 
1.63 (0.77) 
1.93 (1.00) 


17.60 (3.48) 
12.56 (1.48) 
17.17 (3.79) 
13.21 (0.80) 


1.53 (1.12) 
0.49 (0.23) 
0.74 (0.29) 
0.89 (0.57) 


13.07 (3.59) 
8.40 (4.39) 
7.71 (1.65) 
9.99 (2.15) 



Fourier features, Karhunen-Loeve features, some morphological features, pixel 
features and Zernike features extracted from the scanned handwritten digits. 
For the one-class combining problem one class (digit class 3) of handwritten dig- 
its is described by the data descriptions and distinguished from all other classes. 
One hundred training objects are drawn from the target class (no negative ex- 
amples are used). For testing again 100 objects per class, now both target and 
outlier classes, are used. This gives thus a total of 100 target and 900 outlier 
objects. All feature sets are mapped by PCA to retain 90% of the variance in 
the data. After the PCA all features are scaled to zero mean and unit variance. 

All one-class classifiers contain some magic parameters. In the normal distri- 
bution the covariance matrix is regularized by A' = A -|- A1 to make inversion 
of the matrix possible (where A is taken as small as possible to make inversion 
possible, most often A = 1 • 10“^). The number of clusters in the mixture of 
Gaussians, the k-means and k-center methods are 5, 10 and 10 respectively. The 
number of units in the bottleneck layer in the autoencoder network is 5 and the 
SVDD is trained to reject 10% of the target data. Finally, the width parameter 
in the Parzen density is optimized using maximum likelihood optimization Pj. 

In tabled the AUC-errors of the individual methods are shown for the differ- 
ent feature sets. The first three methods are density estimators, the other four 
are distance based methods. Different classifiers give different performances, and 
in most cases the Parzen density estimator performs best. Only for the most dif- 
ficult dataset, the Morphological dataset the normal distribution performs better 
(on average). The best individual classifier is the Parzen density estimator, while 
the easiest dataset to classify is the pixel dataset. Apparently the pixel train- 
ing set is a representative sample from the true distribution and the number 
of training objects is sufficient to do a proper density estimation by a Parzen 
density estimation. Finally note that in some cases the variance is very large! 

In table El the AUC errors are shown for target class 3 when different clas- 
sifiers are combined on the same dataset. In the top part of the table the three 
density methods are combined, the normal density, the mixture of Gaussians 
and the Parzen density estimation. In these cases the output of the methods 
do not require any mapping to probabilities. The results show that the product 
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Table 2. Results of the combination of classifiers by the five combination rules on class 
3. Numbers in bold indicate an improvement over the best individual classifier. 



Combining 3 density methods 




profile 


Fourier 


KL 


morph 


pixel 


Zernike 


mv 

mwv 

pwv 

mp 

PP 


5.61 (1.29) 
5.61 (1.29) 
5.61 (1.29) 
3.00 (1-33) 
2.72 (1.25) 


11.12 (11.93) 
11.12 (11.93) 
11.12 (11.93) 
3.37 (0.89) 
2.60 (0.62) 


6.84 (13.44) 
6.84 (13.44) 
6.84 (13.44) 
1.35 (0.76) 
0.89 (0.54) 


15.59 (0.99) 
15.59 (0.99) 
15.59 (0.99) 
10.85 (0.84) 
10.92 (0.81) 


23.61 (22.56) 
23.61 (22.56) 
23.61 (22.56) 
0.45 (0.34) 
0.30 (0.30) 


8.03 (1.22) 
8.03 (1.22) 
8.03 (1.22) 
6.23 (1-32) 
4.84 (1.61) 


Combining distance methods 




profile 


Fourier 


KL 


morph 


pixel 


Zernike 


mv 

mwv 

pwv 

mp 

PP 


4.23 (1.19) 
4.33 (1.30) 

4.14 (1.19) 
5.71 (1.55) 
3.63 (0.81) 


3.78 (2.71) 
3.78 (2.70) 
3.81 (2.73) 
2.67 (2.13) 

2.62 (2.07) 


1.53 (0.55) 
1.52 (0.55) 
1.48 (0.53) 
1.42 (1.21) 
1.14 (0.59) 


13.51 (0.99) 
13.45 (1.06) 
13.54 (1.03) 
12.86 (1.63) 

11.96 (1.06) 


6.16 (13.68) 
6.18 (13.68) 
6.15 (13.69) 

0.48 (0.27) 
0.48 (0.27) 


7.03 (2.05) 
7.16 (2.36) 
6.93 (1.99) 
7.81 (2.92) 
6.71 (2.31) 


Combining all methods 




profile 


Fourier 


KL 


morph 


pixel 


Zernike 


mvl 

mwv 

pwv 

mp 

PP 


3.42 (1.18) 
3.42 (1.16) 
3.44 (1.15) 
3.23 (0.99) 
2.55 (0.55) 


5.83 (1.09) 

5.84 (1.09) 
5.83 (1.10) 
4.57 (1.83) 
3.35 (0.73) 


1.19 (0.26) 
1.31 (0.58) 

1.22 (0.30) 

1.23 (0.78) 
0.86 (0.42) 


12.33 (0.65) 
12.30 (0.67) 

12.34 (0.65) 
12.29 (1.74) 
12.12 (1.87) 


1.48 (0.68) 

1.47 (0.68) 

1.48 (0.68) 
0.75 (0.56) 
0.64 (0.71) 


5.96 (1.65) 
5.96 (1.62) 
6.15 (1.96) 
7.41 (2.80) 
4.79 (0.96) 



combination rule is a very good combining rule. When the three density meth- 
ods would estimate approximately the same probability, the mean combination 
would give a more robust estimate. The fact that the density models vary much, 
combined with the effect that the mean combination rule tends to increase the 
estimated target class volume (see section IZTHl . causes somewhat worse results 
than the product combination rule. 

In none of the cases the combination rules achieve an improvement over the 
best individual performances of the one-class classifiers. But in most cases the 
product combination rule comes close. Only in one case the mean combination 
rule improves the product combination rule. Furthermore, the first three com- 
bination rules are often significantly worse than the last two, indicating that 
approximating the probabilities by one value is insufficient. Differences between 
these three rules are very small. They have an averaging behavior and often do 
not approach the best individual performance. 

In the middle part of the table the combining results for the combination of 
distance methods is shown. Here a mapping to probabilities is performed (by 
equation ©). Still most often no improvement over the best individual classifier 
can be observed, only for the product combination rule reliable improvements 
can be observed. The individual performances on the Zernike dataset are very 
poor, and almost all combination rules (except for the mean combination rule) 
can improve these. The good performance of the product combination rule is also 
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somewhat surprising, because the classifiers are trained on identical data, while 
the mapping from distance to probability might introduce extra noise. Because 
of the large diversity of the methods however, the errors became uncorrelated 
and extreme estimates are suppressed by the product combination rule. 

Finally the last part of table |2| shows the results of combining both the 
density and distance based methods. Here the best performance never beats the 
best individual performance (most often the Parzen density estimator) . Again it 
can be observed that the product combination rule performs the best. In most 
cases adding the distance methods improves the first three combining rules, but 
deteriorates the last two. 



Table 3. ROC errors obtained by combining the same classifiers trained on the six 
different feature sets. 





Gauss 


MoG 


Parzen 


SVDD 


kmeans 


kcenters 


autoenc 


mv 

mwv 

pwv 

mp 

PP 


1.7 (2.5) 
1.72 (2.5) 
1.84 (2.5) 
1.37 (2.2) 

0.41 ( 0 . 7 ) 


0.87 (1.3) 
0.8 (1.5) 
0.9 (1.2) 
12.0 (1.3) 

0.2 ( 0 . 1 ) 


0.12 ( 0 . 07 ) 
0.12 ( 0 . 07 ) 
0.12 ( 0 . 07 ) 
11.38 (0.82) 
0.07 ( 0 . 05 ) 


7.5 (2.0) 
7.5 (2.0) 

7.5 (2.0) 

2.06 (1.9) 
2.1 (1.8) 


2.8 (3.7) 

3.1 (3.6) 
5.4 (4.3) 

7.2 (4.5) 
3.1 (4.0) 


0.38 ( 0 . 33 ) 
0.37 ( 0 . 33 ) 
0.36 ( 0 . 32 ) 

2.30 (1.12) 
1.77 (1.45) 


0.13 ( 0 . 05 ) 
0.12 ( 0 . 05 ) 
0.12 ( 0 . 05 ) 
0.43 ( 0 . 35 ) 
0.42 ( 0 . 34 ) 



Finally in table 0 the results of combining classifiers on different feature 
sets are shown. Clearly combining different feature sets is more effective than 
combining different classifiers. Only in some cases the performance is worse than 
the best individual classifier. For the density methods it is the mean combination 
rule, while for the three last methods (kmeans, kcenters and the autoencoder 
network) both the mean and product combination rule perform worse than the 
first three rules. Here the results on the different feature sets vary very much. It 
appears that the majority vote and the weighted versions are robust enough to 
use that. 

4 Conclusions 

In this paper we investigated the use of combining one-class classifiers. The best 
individual one-class classifiers in this problem appears to be the Parzen density 
estimator on the pixel dataset. Improving the results of the Parzen estimator 
appears to be hard, because the training sample in this dataset appears to be 
a representative sample from the “true” distribution. As can be expected, com- 
bining classifiers trained in different feature spaces is the most useful. Here the 
different feature sets contain much independent information which often results 
in good classification results. In most situations the product combination rule 
gives the best results. Approximating the probability by just two values does 
often harm the combination rules, so it is useful to use the complete density, 
or distance to the model. The mean combination rule suffers from the fact that 
the area covered by the target set tends to be overestimated, thus more outlier 
objects are accepted than is necessary. 
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Abstract. Given an arbitrary data set, te which ne particular paramet- 
rical, statistical er geemetrical structure can be assumed, different clus- 
tering algerithms will in general preduce different data partitiens. In fact, 
several partitiens can alse be ebtained by using a single clustering alge- 
rithm due te dependencies on initialization or the selection of the value 
of some design parameter. This paper addresses the problem of finding 
consistent clusters in data partitions, proposing the analysis of the most 
common associations performed in a majority voting scheme. Combina- 
tion of clustering results are performed by transforming data partitions 
into a co-association sample matrix, which maps coherent associations. 
This matrix is then used to extract the underlying consistent clusters. 
The proposed methodology is evaluated in the context of k-means clus- 
tering, a new clustering algorithm - voting-k-means, being presented. 
Examples, using both simulated and real data, show how this major- 
ity voting combination scheme simultaneously handles the problems of 
selecting the number of clusters, and dependency on initialization. Fur- 
thermore, resulting clusters are not constrained to be hyper-spherically 
shaped. 



1 Introduction 

Clustering algorithms are valuable tools in exploratory data analysis, data min- 
ing and pattern recognition. They provide a means to explore and ascertain 
structure within the data, by organizing it into groups or clusters. Many clus- 
tering algorithms exist in the literature jbl^ . from model-based mm , non- 
parametric density estimation based methods HSI, central clustering |2] and 
square-error clustering graph theoretical based PE3, to empirical and hy- 
brid approaches. They all underly some concept about data organization and 
cluster characteristics. Best fit to some criteria, no single algorithm can ade- 
quately handle all sorts of cluster shapes and structures; when considering hy- 
brid structure data sets, different and possibly inconsistent data partitions are 
produced by different clustering algorithms. In fact, many partitions can also be 
obtained by using a single clustering algorithm. This phenomena arises due, for 
instance, to dependency on initialization, such as the k-means algorithm, or by 
particular selection of some design parameter (such as the number of clusters, 
or the value of some threshold responsible for cluster separation). Model order 
selection is sometimes left as a design parameter; in other instances, the selection 
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of the optimal number of clusters is incorporated in the clustering procedure ^ 
E], either using local or global cluster validity criteria. 

Theoretical and practical developments over the last decade have shown that 
combining classifiers is a valuable approach in supervised learning, in order to 
produce accurate recognition results. The idea of combining the decisions of clus- 
tering algorithms for obtaining better data partitions is thus worth investigating. 

In supervised learning, a diversity of techniques for combining classifiers has 
been developed izm. Some make use of the same representation for patterns 
while others explore different feature sets, resulting from different processing and 
analysis or by simple split of the feature space for dimensionality reasons. A first 
aspect in combining classifiers is the production of an ensemble of classifiers. 
Methods for constructing ensembles include Pj: manipulation of the training 
samples, such as bootstrapping {Bagging) ^ reweighing the data {boosting) or us- 
ing random subspaces; manipulation of the labelling of data, an example of which 
is error- correcting output coding; injection of randomness into the learning algo- 
rithm - providing random initialization into a learning algorithm, for instance, a 
neural network; applying different classification techniques on the same training 
data set, for instance under a Bayesian framework. Another aspect concerns how 
the output of the individual classifiers are to be combined. Once again, various 
combination methods have been proposed HM, adopting parallel, sequential 
or hybrid topologies. The simplest combination method is majority voting. The 
theoretical foundations and behavior of this technique have been studied PH 
P2|, proving its validity and providing useful guidelines for designing classifiers; 
furthermore, this basic combination rule requires no prior training, which makes 
it well suited for extrapolation to unsupervised classification tasks. 

In this paper we address the problem of finding consistent clusters within 
a set of data partitions. The rational of the approach is to weight associations 
between sample pairs by the number of times they co-occur in a cluster from the 
set of data partitions produced by independent runs of clustering algorithms, 
and propose this co-occurrence matrix as the support for consistent clusters 
development using a minimum spanning tree like algorithm. The validity of this 
majority voting scheme (section ED is tested in the context of k-means based 
clustering, a new algorithm being presented (section EJ . Evaluation of results 
on application examples (section ED makes use of a consistency index between 
a reference data partition (taken as ideal) and the partitions produced by the 
methods; a procedure for determining matching clusters is hence described in 
section 0 

2 Majority Voting Combination of Clustering Algorithms 

In exploratory data analysis, different clustering algorithms will in general pro- 
duce different results, no general optimal procedure being available. Given a 
data set, and without any a priori information, how can one decide which 
clustering algorithm will perform better? Instead of choosing one particular 
method/algorithm, in this paper we put forward the idea of combining their 
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classification results: since each of them may have different strengths and weak- 
nesses, it is expected that their joint contributions will have a compensatory 
effect. Having in mind a general framework, not conditioned by any particular 
clustering technique, a majority voting rule is adopted. 

The idea behind majority voting is that the judgment of a group is superior 
to those of individuals. This concept has been extensively explored in combin- 
ing classifiers in order to produce accurate recognition results. In this section 
we extend this concept to the combination of data partitions produced by en- 
sembles of clustering algorithms. The underlying assumption is that neighboring 
samples within a ’’natural” cluster are very likely to be co- located in the same 
group by a clustering algorithm. By considering the partitions of the data pro- 
duced by different clusterings, pairs of samples are voted for association in each 
independent run. The results of the clustering methods are thus mapped into 
an intermediate space: a co-association matrix, where each (i,j) cell represents 
the number of times the given sample pair has co-occurred in a cluster. Each 
co-occurrence is therefore a vote towards their gathering in a cluster. Dividing 
this matrix values by the number of clustering experiments gives a normalized 
voting. The underlying data partition is devised by majority voting, comparing 
normalized votes with the fixed threshold 0.5, and joining in the same cluster 
all the data linked in this way. Table Q outlines the proposed methodology. 



Table 1. Devising consistent data partitions using a majority voting scheme. 



Input: N samples; E clustering ensembles of dimension R 
Output: Data partitioning. 

Initialization: Set the co-association matrix, co_assoc, to a null N x N matrix. 

Steps: 

1. Produce data partitions and update the co-association matrix: 

For i = 1 to i? do 

1.1. Run the ith clustering method in the ensemble E and produce a data 
partition P\ 

1.2. Update the co-association matrix accordingly: 

For each sample pair, {i,j), in the same cluster in P set 
cojissoc{i,j) = co_assoc{i,j) + T 

2. Obtain the consistent clusters by thresholding on co_assoc 

2.1. Find majority voting associations: 

For each sample pair, {i,j), such that co_assoc(i, j) > 0.5 join the samples in 
the same cluster; if the samples where in distinct previously formed clusters, 
join the clusters; 

2.2. For each remaining sample not included in a cluster, form a single element 
cluster; 

3. Return the clusters thus formed. 



Without requiring prior training, this technique can easily cope with a diver- 
sity of scenarios: classifiers using the same representation for patterns or making 
use of different representations (such as different feature sets); combination of 
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classifications produced by a single method or architecture with different param- 
eters or fusion of multiple types of classifiers. 

Section 2] integrates this methodology into a k- means based clustering tech- 
nique. Evaluation of the results makes use of a partitions consistency index, 
described next. 



3 Matching Clusters in Distinct Partitions 



Let Pi, P 2 be two data partitions. In what follows, it is assumed that the num- 
ber of clusters in each partition is arbitrary and samples are enumerated and 
referenced using the same labels in every partition, s^, i = 1, . . . , n. Each cluster 
has an equivalent binary valued vector representation, each position indicating 
the truth value of the proposition: sample i belongs to the cluster. The following 
notation is used: 



Pi = partition i : {nci, 



C] 






= number of clusters in partition i 
= {s; : Si G clusterj of partition i} 

= list of samples in the jth cluster of partition i 



xUk) = 



lif Sk & C] 



, A: = 1, 



0 otherwise 

= binary valued vector representation of cluster C* 



We define pcJdx, the partitions consistency index, as the fraction of shared 
samples in matching clusters in two data partitions, over the total number of 
samples: 

^ min{nci,nc2} 

pcJdx = — } nsharedi 

n 

i=l 

where it is assumed that clusters occupy the same position in the ordered clusters 
lists of the partitions, and n_sharedi is the number of samples shared for the ith 
clusters. 

The clusters matching algorithm is an iterative procedure that, in each step, 
determines the pair of clusters having the highest matching score, given by the 
fraction of shared samples. It can be described schematically: 

Input: Partitions Pi, P 2 ; n, the total number of samples. 

Output: P 2 , partition P 2 reordered according to the matching clusters in Pi; 
pcJdx, the partitions consistency index. 

Steps: 

1. Convert clusters C* into the binary valued vector description Xy. 



q^x;, i = 1,2 j = i,...,nc. 



2. Set: P2new-indexes{i) = 0, i = 1, . . . UC 2 (clusters new indexes) 

nshared = 0. 

3. Do min {nci, nc 2 } times: 
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— Determine the best matching pair of clusters, (fc, Z), between Pi and P 2 
according to the match coefficient: 



arg max 

{k,l)= i,j 






X\ 



XI + X^j Xj - X^i X] 



— n_shared = nshared + X^Jx'^. 

^ J 

Rename Cj^ as C]^‘- P^^newAndexesi^^^ — k. 

— Remove Cl and Cf from Pi and P 2 , respectively. 

4. If nci > nc 2 go to step|^ otherwise fill in empty locations in P2new_indexes 
(clusters with no correspondence in Pi) with arbitrary labels in the set 
{nci + 1, . . . , nc2}. 

5. Reorder P 2 according to the new clusters labels in P2newAndexes and put in 
P 2 ', set pcJdx = 

6. Return P 2 and pcJdx. 



4 K-Means Based Clustering 

In this section we incorporate the previous methodology in the context of k- 
means clustering. The resulting clustering algorithm is summarized in table |2] 
and will be hereafter referred to as voting-k-means. It basically proposes to gener- 
ate clustering partitions ensembles by random initialization of the cluster centers 
and random pattern presentation. 



Table 2. Assessing the underlying number of clusters and structure based on a k-means 
voting scheme. 



Voting- K- Means algorithm- 

input: N samples; k - initial number of clusters (by default: k — %/iV); 

R - number of iterations. 

Output: Data partitioning. 

Initialization: Set the co-association matrix to a null N x N matrix. 

Steps: 

1 . Do R times; 

1.1. Ramdomly select k cluster centers among the N data samples. 

1 . 2 . Organize the N samples in random order, keeping track of the initial 
data indexes. 

1.3. Run the k-means algorithm with the reordered data and cluster centers 
and update the co-association matrix according to the partition thus 
obtained over the initial data indexes 

2. Detect the consistent clusters though the co-association matrix, using the 

technique defined previously. 
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4.1 Known Number of Clusters 



One of the difficulties with the k-means algorithm is the dependency of the 
partitions produced on the initialization. This is illustrated in figure ^ which 
represents two partitions produced by the k-means algorithm (corresponding 
to different cluster initializations) on a data set of 1000 samples drawn from 
a mixture of two Gaussian distributions with unit covariance and Mahalanobis 
distance between the means equal to 7. Inadequate data partitions, such as 
the one plotted in figure 1(a) can be obtained even when the correct number 
of clusters is known a priori. These misclassifications of patterns are however 
overcome by using a majority voting scheme, as outlined in table El setting k to 
the known number of clusters: taking the votes produced by several runs of the 
k-means algorithm, using randomized cluster center initializations and samples 
reordering, leads to the correct data partitioning depicted in figure 1(b) 
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(a) k-means, k=2. (b) Voting-k-means 

Fig. 1. Compensating the dependency of the k-means algorithm on cluster center ini- 
tialization, k=2. (a)- Data partition obtained with a single run of the k-means algo- 
rithm. (b)- Result obtained using another cluster centers initialization and also with 
the proposed method with 10 iterations. 



4.2 Unknown Number of Clusters 

Most of the times, the true number of clusters is not known in advance and 
must be ascertained from the training data set. Based on the k-means algo- 
rithm, several heuristic and optimization techniques have both been proposed 
to select the number of underlying classes mni- Also, it is well known that 
the k-means algorithm, based on a minimum square error criterium, identifies 
hyper-spherical clusters, spread around prototype vectors representing cluster 
centers. Techniques for selecting the number of clusters according to this opti- 
mality criterium basically identify an ” optimal” number of cluster centers on the 
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data that splits it into the same number of hyper-spherical clusters. When the 
data exhibits clusters with arbitrary shape, this type of decomposition is not 
always satisfactory. In this section we propose to use a voting scheme associated 
with the k-means algorithm to address both issues: selection of the number of 
clusters; detecting arbitrary shaped clusters. 

The basic idea consists of the following: if a large number, fc, of clusters is 
selected, by randomly choosing the initial clusters centers and order of pattern 
presentation, the k-means algorithm will split the training data into k subsets 
which reflect high density regions; if k is large in comparison to the number of 
true clusters, each intrinsic cluster will be split into arbitrary smaller clusters, 
neighboring patterns having a high probability of being co-located in the same 
cluster; by averaging over all associations of pattern pairs thus produced over R 
runs of the k-means algorithm, it is expected to obtain high rates of votes on these 
pairs of patterns, the true clusters structure being recovered by thresholding the 
co-association matrix, as proposed before. The method therefore proposed is to 
apply the algorithm described in table El by setting if to a large value, say VN, 
N being the number of patterns in the training set. 
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(a) K-means - iter. 1. 



(b) Voting-K-means 
- iteration 4. 



(c) Voting-K-means - 
iteration 10. 



Fig. 2. Partitions produced by the k-means (k=14) and the voting-k-means algorithms. 



The method is illustrated in figure 0 concerning the clustering of 200 2- 
dimensional patterns, randomly generated from a mixture of two Gaussian dis- 
tributions: unit covariance; Mahalanobis distance between the means - 10. Fig- 
ure shows a data partition produced by the k-means algorithm (k=14); 
distinct initializations produce different data partitioning. Accounting for per- 
sistent pattern associations along the individual runs of the k-means algorithm. 
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the voting-k-means algorithm evolves to a stable partition of the data with two 
clusters (see figures 2(b) and (c)). 



5 Application Examples 

5.1 Simulated Data 

The proposed method is tested in the classification of data forming two well 
separated clusters shaped as half rings. The total number of samples is 400, 
distributed evenly between the two clusters; the voting-k-means is run setting k 
to 20. 






-H1-+ 



(a) K-means - k=2. 



A: 

4-ft 

% 












(b) Voting-K-means. 





(c) Number of clusters. 



(d) Consistency index. 



Fig. 3. Half ring data set. (a)-(b) Partitions produced by the k-means and the voting- 
k-means algorithms, (c)-(d) Convergence of the voting-k-means algorithm. 



Figure P(a)| plots a typical result with the standard k-means algorithm when 
using k = 2, showing its inability to handle this type of clusters. By taking the 
majority voting scheme, however, clusters are correctly identified (figure |3(b)| ) . 
The convergence of the algorithm to the correct data partitioning is depicted in 
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figures |3(c)| and 
iterations. 

5.2 Iris Data Set 

The Iris data set consists of three types of Iris plants (Setosa, Versicolor and 
Virginica), with 50 instances per class, represented by 4 features. 




3(d) according to which a stable solution is obtained after 25 



Fig. 4. Iris data set: convergence of the voting-k-means algorithm; k = 8. 



As shown in figure E] the proposed algorithm initially alternates between 2 
and 3 clusters, with consistency indexes ranging from 0.67 (2 clusters - Setosa 
vs Versicolor + Virginica) and 0.75 (3 clusters). It stabilizes at the two clusters 
solution, which, although not corresponding to the known number of classes, con- 
stitutes a reasonable and intuitive solution as the Setosa class is well separated 
from the remaining classes, which are intermingled. 

6 Conclusions 

This paper proposed a general methodology for combining classification results 
produced by clustering algorithms. Taking an ensemble of clustering algorithms, 
their individual decisions/partitions are combined by a majority voting rule to 
derive a consistent data partition. 

We have shown how the integration of the proposed methodology in a k- 
means like algorithm, denoted voting-k-means, can simultaneously handle the 
problem of initialization dependency and selection of the number of clusters. 
Furthermore, as illustrated in examples, with this algorithm cluster shapes other 
than hyper-spherical can be identified. 

While explored in this paper under the framework of k-means clustering, the 
proposed technique does not entail any specificity towards a particular cluster- 
ing strategy. Ongoing work includes the adoption of the voting type clustering 
scheme with other clustering algorithms and the extrapolation of this method- 
ology to the combination of multiple classes of clustering algorithms. 
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Abstract. In this paper the theory of unsnpervised multi-layer stochas- 
tic vector quantiser (SVQ) networks is reviewed, and then extended to 
the supervised case where the network is to be used as a classifier. This 
leads to a hybrid approach, in which training is governed both by nn- 
snpervised and supervised pieces in the network objective function. The 
unsupervised piece aims to preserve enough information in the network 
to be able to accurately reconstruct the input (i.e. the network serves as 
an encoder), whereas the supervised piece aims to reproduce the classi- 
fication output supplied by an external teacher (i.e. the network serves 
as a classifier). The tension between these two pieces of the objective 
function leads to an optimal network, in which typically the lower layers 
(near to the input) act as faithful encoders of the input, whereas the 
higher layers (near to the output) act as faithful classifiers. The results 
of some simulations are presented to illustrate these properties. 



1 Introduction 

For a review of the subject of combining classifiers see the introduction to P, 
where it is stated that the two main reasons for combining classifiers are effi- 
ciency and accuracy. Efficiency gains may be obtained when the classifier is im- 
plemented using a network of simple processing operations, and accuracy may be 
improved when results from two or more classifiers (with different strengths and 
weaknesses) are combined Typically, both of these strategies are simultane- 
ously employed. Thus a classifer ensemble is implemented, where each classifier 
has a different set of simple operations, so each has its own peculiar strengths 
and weaknesses. Then the classifiers in this ensemble are combined to produce 
the overall classifier. 

The question that will be addressed in this paper is how to simultaneously 
solve the two problems of designing the separate classifiers and combining them 
together to produce the overall classifier. A novel approach will be used, in which 
the overall classifier will be allowed to emerge by a process of self-organisation, 
that is driven by the minimisation of a suitably chosen objective function. This 
approach implements the overall classifier as a multi-layer network with full 

* (c) British Crown Copyright 2001 
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interconnections between adjacent layers. However, self-organisation typically 
discovers optimal solutions in which the connections implement a set of sim- 
ple processing operations using only a subset of the connections, followed by 
combination of their outputs to produce the overall classifier. 

The basic unit of computation in the self-organising network is a generalisa- 
tion of the standard vector quantiser (VQ) jS], called a stochastic vector quantiser 
(svQ) a, a, in which samples are drawn probabilistically from a codebook. 
The objective function used to optimise a SVQ is the mean Euclidean error that 
occurs when using the SVQ to encode an input as multiple probabilistic samples, 
followed by reconstruction from these samples to estimate the input. If multiple 
samples are allowed, then an SVQ can use much cleverer coding schemes than 
a standard VQ. For instance, self-organisation can cause the codebook to split 
into several smaller codebooks, each of which specialises in encoding only part 
of the input. This propensity for the codebook to split is the key to using self- 
organisation to form a classifier ensemble, in which the different classifiers have 
different strengths and weaknesses. 

Thus far the SVQ objective function is unsupervised, because it makes no 
provision for an external teacher to influence the way in which the SVQ encodes 
its input. This is easily rectified by adding a term to the SVQ objective function 
that attempts to steer the SVQ output towards some desired target output. This 
external supervision is readily put to use in designing a classifier network, where 
it may be used to force the final output of the network to be the required overall 
classifier. 

In Sect. 0 the underlying theory of SVQs is presented, and in Sect. 0 the 
results of simulations are presented to demonstrate the potential use of SVQs to 
classification. 

2 Theory 

In this section various pieces of the previously published theory of stochastic 
vector quantisers (SVQ) are unified to establish a coherent framework for mod- 
elling SVQs. In Sect. 12 . li the basic theory of SVQs is given (which is equivalent 
to the theory of FMCs reported in 0 ), and in Sect. 12.21 it is extended to the case 
of high-dimensional input data 0 . In Sect. 12.31 the theory is further generalised 
to chains of linked SVQs [3, and the use of an external teacher to supervise the 
chain is explained in Sect. El 



2.1 Stochastic Vector Quantisers 

The basic building block of the encoder/decoder model used in this paper is 
the folded Markov chain (FMC) jS], which is equivalent to the SVQ discussed 
in Sect. 0 Thus an input vector x is encoded as a code index vector y, which 
is then subsequently decoded as a reconstruction a;' of the input vector. Both 
the encoding and decoding operations are allowed to be probabilistic, in the 
sense that y is a sample drawn from Pr(y|a:), and x' is a sample drawn from 
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Pr(a:'|y), where Pr(y|a;) and Pr(a:'|y) are Bayes’ inverses of each other, as given 

by Pr(a;'|y) = j d probability from 

which X was sampled. Because the chain of dependences in passing from x to 
y and then to x' is first order Markov (i.e. it is described by the directed graph 
a; — 7> y — >■ a;', and because the two ends of this Markov chain (i.e. x and x') live 
in the same vector space, it is called a folded Markov chain . 

In order to ensure that the SVQ encodes the input vector optimally, a measure 
of the reconstruction error must be minimised. There are many possible ways 
to define this measure, but one that is consistent with many previous results, 
and which also leads to many new results, is the mean Euclidean reconstruction 
error measure D, which is defined as 



M M 



M 



D = dx Pr(a;) / dx'Pr{x'\y) \\x - x' 



/l|2 



( 1 ) 



1/1 = 1 1/2 = 1 1/n = l 



where y = (yi, y 2 , • ■ • , Vn), ^ < Vi < M is assumed, Pr(a:)Pr(y|a:)Pr(a:'|y) is the 
joint probability that the SVQ has state {x,y,x'), ||a; — a;'|p is the Euclidean 
reconstruction error, and J dx X)yi=i Sy 2 =i ' ' ' Syn=i f dx'{- ■ ■) sums over all 
possible states of the SVQ (weighted by the joint probability). 

The Bayes’ inverse probability Pr(a;'|y) may be integrated out of this expres- 
sion for D to yield 



M M 



M 



D = 2 



dx Pr(a;) \\x-x'iy)\? 



( 2 ) 



yi=iy2=i yn=i 



where the reconstruction vector x'{y) is defined as x'{y) = J dx Pr(a:|y)a;. 
Because of the quadratic form of the objective function, it turns out that 
a;'(y)may be treated as a free parameter, whose optimum value (i.e. the solution 



of 



dP 

dx'{y) 



= 0) is f dx Pr(a;|y)a:, as required. 



2.2 High Dimensional Input Spaces 

A problem with the standard VQ is that its code book grows exponentially 
in size as the dimensionality of the input vector is increased, assuming that 
the contribution to the reconstruction error from each input dimension is held 
constant. This means that such VQs are useless for encoding extremely high 
dimensional input vectors, such as images. The usual solution to this problem 
is to manually partition the input space into a number of lower dimensional 
subspaces, and then to encode each of these subspaces separately. However, it 
would be very useful if this partitioning could be done automatically, in such a 
way that typically the correlations within each subspace were much stronger than 
the correlations between subspaces, so that the subspaces were approximately 
statistically independent of each other. This is an example of the self-organised 
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discovery of a classifier ensemble, in which each classifier focusses on only a 
subspace of the input. 

The key step in solving this problem is to constrain the minimisation of D in 
such a way as to encourage the formation of code schemes in which each compo- 
nent of the code vector y codes a different subspace of the input vector x. There 
are two related constraints that may be imposed on Pr(y|a;) and x'{y) which 
may be summarised as 

1 "■ 

Vr{y\x)=Vr{yi\x)Vr{y 2 \x)---Vv{yn\x), x' {y) = -^x' {yi) (3) 

i=l 

Thus, for z = 1, 2, • • • , n and 1 < j/i < M, each component yi is an independent 
sample drawn from the codebook using Pr(?/i|a;) (which is assumed to be the 
same function for all i), and the reconstruction vector x'{y) (vector argument) 
is assumed to be a superposition of n contributions x'(jji) (scalar argument). 
Taken together, these constraints encourage the formation of coding schemes in 
which independent subspaces are separately coded, as required. 

The constraints in Eq.0prevent the full space of possible values of Pr(y|a;) or 
x'{y) from being explored as D is minimised, so they lead to an upper bound 
Di + D 2 on the SVQ objective function D (i.e. D < Di + D 2 ), which may be 
derived as (the details of this derivation, including the derivatives of Di + D 2 , 
are reported in 0) 



Di 

D2 



2 f 

- / da;Pr(a:)^Pr(?/|a;) \\x-x'{y)\\'^ 



2(n- 1) 



n 



dx Pr(a?) 



M 

X - ^Vr{y\x)x' {y) 
y^l 



( 4 ) 



Note that M (size of codebook) and n (number of samples drawn from code- 
book using Pr(?/|a;)) are effectively model order parameters, whose values need to 
be chosen appropriately for each encoder optimisation problem. The properties 
of the optimum encoder depend critically on the interplay between the statistical 
properties of the training data and the model order parameters M and n. 

In numerical simulations it is convenient to parameterise (i.e. constrain) 
Pr(?/|a;) thus 



Pr(y|a;) 



Q{y\x) 



Q{y\x) x_^exp(-ii;(2/).a;-6(j/)) 



( 5 ) 



where Q{y\x) is a sigmoid function of x, with weight vector w{y) and bias h(y). 



2.3 Chain of Linked Stochastic Vector Quantisers 

An SVQ may be generalised to a chain of linked SVQs, and further generalisation 
to any acyclically linked network of SVQs is also readily achieved. The vector 
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of probabilities (for all values of the code index) computed by each stage in the 
chain is used as the input vector to the next stage, and the overall objective 
function is a weighted sum of the SVQ objective functions derived from each 
stage. There are other ways of linking the stages together and defining an overall 
objective function, but the above prescription is the simplest possibility. The 
total number of free parameters in an L stage chain is 3L — 1, which is the sum of 
2 free parameters for each of the L stages, plus L — 1 weighting coefficients. There 
are L — 1 rather than L weighting coefficients because the overall normalisation 
of the objective function does not affect the optimum solution. 

The chain of linked SVQs will now be expressed mathematically. Firstly, an 
index I (where 1 < I < L) is introduced to allow different stages of the chain to 
be distinguished thus 

M ^ n ^ x ^ x' ^ 

( 6 ) 

Then the stages are then defined and linked together thus 

a;(0 ^ yd) ^ 

™(i+i) _ .^.(*+1) ... 

= Pr(y*^^^ = 1 < f (7) 

Finally, the objective function and its upper bound are given by 

D = Y^ < Di + D2 = Y^ (8) 

where > 0 is the weighting that is applied to the contribution of 

stage I of the chain to the overall objective function D. 



2.4 Supervision by an External Teacher 

The objective function D in Eq. 0 can readiliy be extended to allow an ex- 
ternal teacher to supervise the chain of linked SVQs. Thus make the following 
replacement in Eq. 0 

12 1 ' 2 ' supervise ^ ^ 

where ■ is any convenient objective function that the external teacher 

supervise ■' ■' 

wishes to apply to stage I of the chain. For instance ■ might mea- 

° supervise ° 

sure the Euclidean error between the output of stage I (whose components are 

Pr(y(0 = i\x^'‘'>), I < i < and some externally supplied reference vector. 

Note that supervision can be applied to any or all of the stages, and is not 

limited to only the final stage, as is conventional in supervised training. In the 

context of combining classifiers, not only the design of the combined classifier, 

but also the design of the individual classifiers may be supervised. 
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3 Simulations 

In this section the results of several simulations are presented to illustrate the 
behaviour of SVQs in both unsupervised and supervised training scenarios. Cir- 
cular (i.e. and 2-toroidal (i.e. x S^) input manifolds are used. 

3.1 Circular Input Manifold 

The simplest demonstration of a SVQ is to train it (unsupervised) with data 
that lives on a circular manifold. Fig. d shows contour plots of the posterior 




Fig. 1. Contours of Pr(j/|a;) {y = 1,2, 3, 4) for data lying on a circle (represented by 
the white circle in each plot), trained using M = 4 and n = 10 



probabilities Pr(?/|a;) for ?/ = 1, 2, 3, 4 in a SVQ with M = 4 and n = 10. The 
circular manifold is chopped up by the Pr(j/|a:) into four softly overlapping arcs. 

3.2 2- Toroidal Input Manifold 

A more sophisticated demonstration of a SVQ is to train it (unsupervised) with 
data that lives on a toroidal manifold. 

In Fig. Ha) each of the Pr(y|a;) for M = 8 and n = 5 has a localised response 
region on the 2-torus, and the 2-torus is thus chopped up into eight softly over- 
lapping regions. This result (2-dimensional manifold) may be compared with 
the simpler result (1-dimensional manifold) in Fig. d In both cases, sampling 
a single code index from the code book is sufficient to detemine the location 
of the input vector to an accuracy corresponding to a single localised response 
region. A major disadvantage of this type of encoder is that the size of the 
code book that is required to guarantee a given resolution (in each dimension of 
the input manifold) increases exponentially with input manifold dimensionality. 
This would be completely useless for very high dimensional applications, such 
as image processing. This general type of encoder will be called a joint encoder, 
because it simultaneously encodes all of the dimensions of the input manifold. 

In Fig. dt>) the results shown are analogous to those shown in Fig. 121(a), 
except that M = 8 and n = 50, so 10 times as many samples are now drawn 
from the code book. There is a marked difference in the shape of the response 
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region of each of the Pr(y|a:). The set of Pr(j/|a;) for y = 1,2, •••,8 has split 
into two subsets. In one subset the response regions are elongated vertically, 
and in the other subset they are elongated horizontally. In all cases there is 
approximate invariance of Pr{y\x) with respect to variations of x along the 
direction of elongation of the response region. In effect, one subset encodes one 
of the circular subspaces of the 2-torus x S^, and the other subset encodes 
the other circular subspace. In this type of encoder it is necessary to sample 
many times from the code book in order to guarantee that at least one sample 
is drawn from each of these subsets, so that both of the circular subspaces are 
represented in the code. In this case the location of the input vector may be 
determined to an accuracy corresponding to the region of intersection of an 
orthogonal pair of elongated response regions. 




Fig. 2. All of these plots use toroidal boundary conditions, (a) (left) 'Pr(y\x) for data 
lying on a 2-torus, trained using M — 8 and n = 5. (b) (middle) Pr(y|a:) for data lying 
on a 2-torus, trained using M = 8 and n = 50. (c) (right) Vr(y\x) for data lying on a 
2-torus, trained using M = (8, 2) and n = (50, N/A), and using strong supervision 



A major advantage of this type of encoder is that the size of the code book 
that is required to guarantee a given resolution (in each dimension of the input 
manifold) increases linearly with input manifold dimensionality, and the price 
that has to be paid for this is the need to sample many times from the code 
book. This general type of encoder will be called a factorial encoder, because it 
separately encodes each of the dimensions of the input manifold. This propensity 
for the codebook to split into a number of smaller code books is the key to using 
self-organisation to form a classifier ensemble, in which the different classifiers 
have different strengths and weaknesses. 
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3.3 2- Toroidal Input Manifold with Supervision 

A yet more sophisticated demonstration of an SVQ is to train it with 2-toroidal 
data (see Sect. ED> but this time introduce some supervision by an external 
teacher. 

To allow a reasonable amount of flexibility a 2-stage encoder will be used (see 

Sect. I2..3|l . where zero weight will be assigned to the contribution to 

the overall objective function D, so that the second stage is devoted entirely to 

( 2 ) 

dealing with the supervision that is introduced via the . contribution. 

^ supervise 

Back-propagation of derivatives of the objective function will ensure that this 

supervision also influences the first stage of the 2-stage encoder. 

The first and second stages will use code books with parameters {M = 8,n = 

50) and (M = 8, n = N/A), respectively, and the external teacher will attempt to 

make the pair of outputs from the second stage equal to the pair of target outputs 

shown in Fig. Ol^a). These target outputs form oppositely signed checkerboard 





Fig. 3. All of these plots use toroidal boundary conditions, (a) (left) Target outpnts 
used for supervision of a 2-stage encoder, (b) (middle) Output produced when trained 
using weak supervision, (c) (right) Outpnt produced when trained nsing strong super- 
vision 



pattern (of O’s and I’s) on the 2-toroidal input manifold, so that they have the 
properties of a posterior probability (i.e. non-negative and sum to unity at each 
point on the 2-torus). This particular form of target output has been chosen to 
be more difficult to produce using a factorial encoder (see Fig. m) than a joint 
encoder (see Fig. EKa)), because the latter uses response regions that are similar 
in shape to the squares in the checkerboard pattern used by the external teacher 
(see Fig. EKa)). 

When the 2-stage encoder is trained using supervision that is weak enough 
not to significantly influence the first stage, the results are the same as those 
shown in Fig.|2Kb). Fig.EKb) then shows the output of the second stage, which 
should be compared with the required output shown in Fig. 0(a). The results 
in Fig. ETb) are poor because the weak supervision signal cannot produce large 
enough back-propagated derivatives to override the unsupervised training of the 
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first stage, which insists on producing a factorial code in which the two circular 
subspaces of the 2-toroidal input manifold are separately encoded (see Fig. 0b)). 
In effect, these results are produced by combining the outputs of two badly 
designed classifiers, each of which concentrates on only one of the subspaces 
of the 2-torus. 

When the supervision is strong enough to significantly influence the first 
stage, the results are as shown in Fig. 0c) and Fig. El)c) (which are analogous to 
Fig. 0b) and Fig. 0b), respectively). Fig. 0c) shows that the factorial encoder 
that arose when weak supervision was used (see Fig. 0b)) has now been modified 
by the use of strong supervision (via back-propagation of the correspondingly 
large derivatives) to resemble a joint encoder (compare Fig. 0a)). This is as 
expected, because the response regions of a joint encoder resemble the shape of 
the squares in the checkerboard pattern used by the supervisor (see Fig. 0a)). 
Fig. 0c) shows the output of the second stage, which now much more closely 
resembles the required output (see Fig. 0a)) than when the supervision was 
weak (see Fig. 0b)). In effect, these results are produced by a single classifier 
that makes optimal use of the x 2-torus, rather than concentrating on 
only one of its subspaces at a time, as was the case in Fig. 0b). 

These results obtained from a 2-stage network of linked SVQs are illustrative 
of the more general possibilities offered by acyclically linked networks of SVQs. 

4 Discussion 

A key behaviour of 2-stage encoders is exemplified by the results obtained from 
a 2-toroidal input manifold with supervision (see Sect. E0). 

When only weak supervision is used, the first stage optimises itself to encode 
the input so that it can reconstruct it with minimum distortion (i.e. minimise 
Di + D 2 in Eq. 0 . There is no guarantee that this encoder will be any good for 
accurately producing the output required by the weak supervision. Effectively, 
the first stage preprocesses the input in a way that is heedless of the nature 
of the task that the second stage has to do. This is analogous to the situation 
where an overall classifier is constructed by combining the outputs of a set of 
individual classifiers, that are designed in ignorance of the overall classification 
problem that is to be solved, so it does not perform very well. 

When strong supervision is used, the first stage is forced to heed the re- 
quirements of the second stage, and adjusts the way in which it preprocesses 
the input, so that the second stage is able to accurately produce the output 
required by the strong supervision. This is analogous to the situation where an 
overall classifier is constructed by combining the outputs of a set of individual 
classifiers, that are designed in way that is mindful of the overall classification 
problem that is to be solved, so it performs very well. 

There are many ways in which the results of Sect. t3..3l can be generalised. 
The multi-stage encoder could have more than two stages, and in general it 
could be an acyclically linked network of encoders having multiple input and 
multiple output stages. The basic properties of each encoder are specified by 
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three parameters: size of the code book M , number of samples n drawn from 
the code book, and objective function weighting s. Supervision by an external 
teacher could be added to any or all of the encoder outputs. 

5 Conclusions 

In this paper SVQs have been shown to be a flexible tool for self-organised 
classifier fusion. The unsupervised part of the network objective function tries to 
preserve information about the input data as it is processed and passed through 
the network. The supervised part of the objective function tries to ensure that 
the required classifier output is produced at the network output, by judiciously 
discarding irrelevant information and massaging what is left into the required 
output form. The tension between these two pieces of the objective function 
leads to an optimal network, in which typically the lower layers (near to the 
input) act as faithful encoders of the input, whereas the higher layers (near to 
the output) act as faithful classiflers. 
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Abstract. In this paper, the error-reject trade-off of linearly combined multiple 
classifiers is analysed in the framework of the minimum risk theory. Theoretical 
analysis described in [12,13] is extended for handling reject option and the 
optimality of the error-reject trade-off is analysed under the assumption of 
independence among the errors of the individual classifiers. Improvements of 
the error-reject trade-off obtained by linear classifier combination are 
quantified. Einally, a method for computing the coefficients of the linear 
combination and the value of the reject threshold is proposed. Experimental 
results on four different data sets are reported. 



1 Introduction 

It is well known that reject option is useful for improving the classification reliability 
in pattern recognition applications for which the cost of rejecting certain patterns, and 
handling them with different procedures (e.g., manual classification), is lower than the 
cost of wrong classifications. In the framework of the minimum risk theory, Chow 
defined the optimal classification rule with reject option [2]. Let be and the 

costs for the correct and for the wrong classification of pattern x, and be the cost 

for reject (obviously, usually, w,,=0). The Bayes expected risk is then: 

wj’icorrect) + WgP(reject) + wJ^{error) . (1) 

Accordingly to Chow’s rule, the above expected risk is minimised by accepting a 
pattern x and assigning it to the class Oi, if: 

max, p(wi I x)= p(a>, | x)> T = (wj - Wj)/ (wj -w,,) . (2) 

where P( tt»|x) is the i-th class posterior probability of x. Otherwise, x is rejected. 

Even in multiple classifier systems (MCSs) there are cases in which the 
classification of a pattern is poorly reliable. For instance, the majority of classifiers 
can disagree about the classification of an input pattern; or, when the Bayesian 
average combination rule is used, more data classes can exhibit comparable values of 
the posterior probabilities. Therefore, the reject option is useful also for improving 
classification reliability of MCSs. It is worth noting that Chow's rule can be used only 
for combination rules which provide estimates of the class posterior probabilities, like 
the Bayesian average combination rule [15,8] or the linear combination of neural 
networks [10]. For other kinds of combining rules, the classification reliability must 
be evaluated using the specific kind of information provided by the combiner. As an 
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example, for the majority rule and its variations (i.e., weighted voting), the reject 
option is based on the disagreement among the decisions of the individual classifiers 
[1]. Therefore, the error-reject trade-off of MCSs strongly depends on the particular 
combination rule used. 

Besides many experimental works that reported the benefits of combining 
classifiers, some theoretical works investigated the hypotheses under which 
combination can improve the performances of the individual classifiers, and some 
papers quantitatively evaluated such improvements. For instance, Lam and Suen 
provided theoretical results allowing one to understand and quantify the performances 
of the majority rule [9]. Turner and Ghosh analysed the performances of linearly 
combining multiple classifiers in the framework of the Bayes decision theory [12,13]. 
However, to the best of our knowledge, no theoretical work addressed the problem of 
the error-reject trade-off for MCSs. Some papers have shown by experiments that 
classifier combination can improve the error-reject trade-off of individual classifiers 
[6,10,8]. However, such papers did not analyse the hypotheses under which this can 
happen and they did not quantify the improvements. 

In this paper the error-reject trade-off of linearly combined multiple classifiers is 
analysed in the framework of the minimum risk theory. Sect. 2 basically extends the 
work of Turner and Ghosh, that was confined to the case without reject option 
[12,13]. In Sect. 3, a method for computing the coefficients (“weights”) of the linear 
combination and the value of the reject threshold is proposed. Experimental results 
are reported in Sect. 4. 



2 Error-Reject Trade-Off for a Linear Combination of Classifiers 

The optimal error-reject trade-off is achieved by Chow's rule only if posterior 
probabilities are exactly known. Unfortunately, this does not happen in practical 
applications [5]. Therefore, in the following, we will assume that classifiers provide 
estimates of posterior probabilities, and compare the error-reject trade-off achievable 
by a linear combination of multiple classifiers with the optimal trade-off that could be 
obtained if posterior probabilities were exactly known. More precisely, as Chow's rule 
provides the minimum error probability P{error) for any value of the reject 
probability P(reject) [3], we compare, for a given value of P{reject), the values of 
P(error) achieved by a single classifier (Sect. 2.1) and by an MCS (Sect. 2.2) with 
Chow’s optimal value of P(error). The theoretical contribution of this paper is 
contained in Sect. 2.1, where we show the dependence between P{error) and the 
estimate errors on posterior probabilities. It is worth noting that this dependence is 
similar to that found by Turner and Ghosh [12,13] for the case without reject option. 
This allows us to extend their results to MCSs with reject option (Sect. 2.2). 



2.1 Theoretical Framework 

Let us indicate the estimated posterior probability for the i-th class as: 

A (U = a(U + «, (U ’ (^) 

where p,(x) is the “true” posterior probability and ^(x) is the estimation error. In the 
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following, we will consider a simple one-dimensional classification task with two 
data classes o\ and ox, characterised by Gaussian distributions. At the end of this 
section, we will point out that our analysis holds for a general classification task. 

Fig. 1 shows the true and the estimated posterior probabilities of classes ty, and ox^. 




Fig. 1. The true and the estimated posterior probabilities of classes co^ and o\aie shown 

Let us consider values of the reject thresholds T and T’ such that the optimal reject 
region [x^, xj and the estimated one {x^+b^, x^+b^ provide the same P(reject). We 
assume that the estimation errors are reasonably small, such that the offsets b^ and b^ 
between the estimated and the optimal decision and reject regions are reasonably 
small [5]. Due to the estimation errors, patterns belonging to [x^, x^+b^ are accepted 
and assigned to class OX^ instead of being rejected. Analogously, patterns belonging to 
[x^, are rejected instead of being accepted and assigned to ox^. 

Our goal is to compare, for a given P(reject), the P{error) achieved using the 
estimated and the true posterior probabilities. Following the work of Turner and 
Ghosh [12,13], we will first express the offsets b^ and b^ as a function of the 
estimation errors e\(x) and £^{x). Then, we will express and analyse the difference 
between the error probabilities (shaded areas in Fig. 1) as a function of h, and b^. 

The estimated posterior probabilities of classes ox^ and ox^ take on the same value 
(T) at the boundaries of the reject region: p^(x^ + b, )=A(^ +b,)- FromEq. (3): 

Pi(jti + h)+£i(jCi +bt)= P2(x^ + b,)+e^(x^+b^) ■ 

Linearly approximating p^{x) and PjCx), respectively around x^ and x^, and noting 
that (these values are equal to the reject threshold T), we can write: 

b,p;(x,)+ e,(x, + b,) = b^p',,(x^)+ e^(x^ +b^) ■ (4) 

We can now express b^ as a function of b^ by exploiting the equality between the 
reject probabilities of the true and the estimated error regions (see Fig. 1): 

I p(x)dx = I p(x)dx ’ 

where p(x) is the probability density distribution of x. If we approximate the values of 
p{x) in the domains of integration [Xj, Xj -i- hj and [x^, x^ + bj, respectively with the 
constants terms p(x^) and p{x^, we obtain p{x^)b^ = p{x^b^. Accordingly: 
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*2 =[p(j^l)//>(->t2))>l • 

By substituting Eq. (5) in Eq. (4), we obtain as a function of f, and e{. 

^ _ >^A^^ + K)- +b,) , ( 6 ) 

where i' is a constant term. The above expression is similar to the one obtained by 
Turner and Ghosh for the case without reject option [12,13]. 

Now let us consider the difference AE between the error probabilities achieved 
using the estimated and the true posterior probabilities: 

A£ = | p^{x)p(x)dx — ^ p^(x)p(x)dx ■ 

As Chow’s rule applied to the estimated posterior probabilities is not optimal, it 
follows that AE > 0. Linearly approximating p,(x) and pj^x): 

p,{x, + b,)= p,{x,)+b,p',{x,), p,{x, + b,) = p,{x2)+ b,p;{x2) > 

and using the above constant approximation for p(x), the two above integrals coincide 
with the areas of two trapeziums, as shown in Fig. 2. 




Finally, using Eq. (5), it is easy to see that AE can be expressed as follows: 

A £(*,) = [/>! Cl )- />C, jp.Ci ))>1 + \p2 (x,)p(x, )-p‘(x, )p;(xi ) / = ci2, + db‘ ’ 

where c and d represent constant terms. It is worth noting that the expression of AE 
for the case without reject option contains only a second degree term [12,13]. 

Eqs. (6) and (7) show how the added error probability AE, corresponding to a given 
value of P(reject), depends on the offset b^, and therefore on the estimation errors 
fj(x) and £pc). We pointed out that these expressions are similar to the ones obtained 
by Turner and Ghosh [12,13] for the case without reject option. It is then possible to 
extend their analysis to the case with reject option. More precisely, the expected value 
E^jj related to AE can be computed as a function of the parameters (mean and 
variance) of the probability density functions of £\{x) and £^{x). It is then possible to 
compare the values of E^ corresponding to a single classifier and to an MGS. The 
details of the computations can be found in [4]. The results are summarised in Sect. 
2 . 2 . 

We conclude this section by discussing briefly the main assumptions we made 
above. For a problem with more than two classes and more than one decision 




Error Rejection in Linearly Combined Multiple Classifiers 



333 



boundary, the reject region can be made up of several disjoint intervals corresponding 
to points in which the posterior probabilities of the local dominant classes are lower 
than the reject threshold. However, it is easy to see that the above analysis would lead 
to expressions of the offset b and of the error probability AE similar to Eqs. (6) and 
(7). A linear approximation of the distribution p{x), instead of the constant 
approximation above, would lead to a non linear dependency of from the estimation 
errors. But this complicates only the mathematical derivation of the probability 
distribution of b^. 



2.2 Error-Reject Trade-Off for Linear Combination of Unbiased and Biased 
Classifiers 



In the following, we hypothesise that the estimation errors £^{x) and e^ix) are 
independent variables with means /Jj and and variances and . 

We consider first the unbiased case (/Jj = 1^2 = 0). From Eqs. (6) and (7) it turns out 
that the expected value of the added error probability A£’(7>j) is = dal > where 
al is the variance of iij. Let us now consider a constrained linear combination of the 
outputs of A classifiers. The estimated posterior probabilities, denoted with p°”{x), can 
be expressed as follows: 



A ” (^) = E* , “T? (x) = E * , ", P.(x) + A (x) ’ 



( 8 ) 



where 



a,>0, y"a, = l, (9) 

and the estimation error is ^ = ajA'U)- case, denoting with b"" the offset, 

the expected value £■“ of the added error probability is ez = daZ =^E* 

out that the values of the coefficients which minimise ez, , taking into account the 
constraints of Eq. (10), are: 

Therefore we obtain: 



EZ <(U N)m-xx EZ ■ 

;=1, .W 

This shows that the linear combination reduces ez, > with respect to the worst 
individual classifier, up to a factor N. If the errors e!.{x) of the different classifiers 
exhibit the same variance (o", = a", , V/>A:)> we obtain a. = \!N (simple average). In 
this case the linear combination reduces the added error by a factor 1/A: 

EZ = da\„=dallN=E^IN ■ 

Let us consider now the biased case ( fi. ^ 0). For a single classifier we obtain: 
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=cP+d(al +/?') > 



where jBis the mean of the offset b^. It is worth noting that this expression differs from 
the case without reject option for the term proportional to /?[12]. Linearly combining 
Af classifiers, the expression of ez, is: 



EZ = c/3 + d(al.. + p" c(^" ^ a,p ‘ )+ d _ a] a[, + _ a,p‘ j 



( 11 ) 



where /?=^" a,/?' is the mean of the offset b“'‘. In this case the values of the 

coefficients a. which minimise ez, depend on the trade-off between the minimisation 
of the variance and of the mean of b“'‘. Therefore the simple average could be not the 
optimal choice even if the errors ^(x) of the different classifiers exhibit the same 
variance. In this case the simple average reduces by a factor N the variance of b"\ Let 
us assume that its mean is reduced by a factor z > 1 = ph). Then, if z<^N it turns 

out that EZ,<E^!z, while if z>-Jn , then EZ, < E^ i -Jn ■ The improvement of the error- 
reject trade-off, due to averaging biased classifiers, is then limited by min(z,-/iv)' This 

is a lower improvement than the one achievable without reject option, which is 
limited by min(z^,A0 [13]. 

Let us now remind that ez, represents, for any given value of P{reject), the 
difference between the P(error) of the linear classifier combination and the minimum 
error probability that Chow’s rule could provide if posterior probabilities were exactly 
known (see Sect. 2.1). Accordingly, we can say that, for any given value of P{reject), 
the value of P(error) achieved by a linear combination of classifiers approximates 
Chow’s minimum value of P(error) as much as the number of classifiers increases. 



3 An Algorithm for Computing the Coefficients of the Linear 
Combination 

We showed that linearly combining N classifiers with independent estimation errors, 
the difference with respect to the optimal error-reject trade-off can decrease up to a 
factor N. However, theoretical results provided in Sect. 2 cannot be used to determine 
the values of the coefficients (“weights”) of the linear combination, since, in real 
applications, the probability distribution of the estimation errors is unknown. In 
particular. Sect. 2 does not provide a method for computing the coefficients in Eq. 
(10) and the ones that minimise the expected value of the added error probability (Eq. 
11). Without reject option, the commonly used approach to evaluate the coefficients 
relies on the minimisation of the error rate of the MCS on a validation set [10,7,14]. 
The rationale behind such approach is that, in principle, values of the coefficients 
such that the corresponding error probability is not higher than the one of the best 
single classifier always exist. As an example, if we could know that the best classifier 
on the test set is the i-th one, then a trivial choice of the coefficients would be q = 1 
and a. = Q,j^ i. When using the reject option, one can find first the coefficients which 
minimise the error probability without reject, then, using these coefficients, find the 
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value of the reject threshold T that minimises the expected risk. This approach was 
used in [10]. However, we point out that this approach does not guarantee to 
effectively minimise the expected risk, since a separate minimisation of the 
coefficients a. and the reject threshold T is performed. Accordingly, both the 
coefficients and the reject threshold should be determined by minimising the expected 
risk for given values of the costs w^, and Let us therefore consider the problem 
of minimising the expected risk as a function of the coefficients and of the reject 
threshold, for given values of w^, w^, and w^. Since P{correct) = 1 - P(error) - 
P(reject), the expected risk (Eq. 1) can be rewritten as follows: 

risk = Wp + (Wj; - w^)P(reject) + - w^P(error) . (12) 

Accordingly, the minimisation problem is: 

min^risk(a,,...,or.v,r| * (13) 

Since in real applications the P{error) and P(reject) can only be estimated from a 
finite data set, the corresponding estimate of the expected risk in Eq. (13), called 
“empirical” risk, is a discrete-valued function and cannot directly be minimised using 
techniques based on gradient descent. It is also difficult to approximate the empirical 
risk by a smooth function, as proposed in [14] for the error rate of a classifier (that 
corresponds to the empirical risk without reject option). We have therefore developed 
a special purpose algorithm for solving the above minimisation problem. Such 
algorithm was derived from another proposed by the authors in [5]. 

Our algorithm iteratively searches for a local minimum of the target function. 
Starting from a point consisting of given values of the N + l variables, at each step a 
neighbourhood of the current point is explored. Such neighbourhood consists of 
points obtained from the current one by incrementing each variable, one at a time. 
The amplitude and the number of the increments are predefined. If a point exhibiting 
a value of the target function lower than the current one is found, then it becomes the 
new current point; otherwise the algorithm stops, and returns the current point as 
solution. The solution, for given values of w^, and w^, corresponds to a point in the 
Error-Rejection (E-R) plane. The E-R curve of the MCS is obtained by varying the 
value of the costs. We point out that, since minimising function (12) is equivalent to 
minimise P(reject) + WP(error), where w = {w,, - Wc)l {w„ -w^) ■e\¥\,+x\, then it is more 
convenient to minimise this last expression, since it depends on only one parameter. 
Since our algorithm can lead to a local minimum of the target function, the so called 
“multi-start technique” was used. For the same given values of the costs, the 
algorithm was run for a predefined number of times, starting from random values of 
the N + \ variables. 



4 Experimental Results 

Our experiments were aimed at comparing the error-reject trade-off achievable by a 
single classifier and by a linear combination of classifiers. The error-reject trade-off is 
represented here by the Accuracy-Reject (A-R) curve that is equivalent to the E-R 
curve because minimising the error probability for any given reject probability is 
equivalent to maximise the accuracy. The accuracy is defined as 
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p{correct\accepi)= p{correct)i [i - p(re;eei)] and it has been estimated as the ratio between the 
correctly classified patterns and the accepted patterns. The experiments have been 
carried out on four different data sets. We used a data set of remote sensing images 
(the Feltwell data set [11]), two data sets from the ELENA database (Phoneme and 
Satimage) (ftp://ftp.dice.ucl.ac.be/pub/neural-nets/ELENA/databases), and one from 
the STATLOG database (Letter) (http://www.ncc.up.pt/liacc/ML/statlog/). Some 
details about such data sets are given in Table 1. 

Four types of classifiers have been used: the linear and quadratic Bayes classifier 
(indicated in the following with LB and QB), the multilayer perceptron neural 
networks (MLP), and the A:-nearest neighbours classifier (A:-NN). Each data set has 
been subdivided into a training and a test set. After the training phase, the A-R curves 



Table 1. Data sets used for experiments 



Data Set 


Training patterns 


Test patterns 


Features 


Classes 


Feltwell 


5,820 


5,124 


15 


5 


Phoneme (ELENA) 


2,702 


2,702 


5 


2 


Satimage (ELENA) 


3,213 


3,216 


36 


7 


Letter (STATLOG) 


15,000 


5,000 


16 


26 



of the single classifiers have been assessed on the test set. Then, the A-R curve of the 
linear combination of such classifiers has been computed by the algorithm described 
in the previous section. We have considered values of the cost parameter W that lead 
to reject rates between 0 and 20%, since this range is usually the most relevant for 
application purposes. Figs. 3-6 show the A-R curves of the single classifiers and of 
the linear combination (denoted with MCS) for the four data sets. For any value of the 
reject rate, the accuracy achieved by the MCS is always higher than the accuracy of 
the single classifiers. This means that the method proposed in Sect. 3 allows one to 
obtain a MCS with a better error-reject trade-off than that of each single classifier. It 
is worth noting that the linear combination allows a significant improvement of the 
classification accuracy when the performances of the single classifiers are similar 
(Feltwell and Satimage data sets). Differently, linear combination accuracy is close to 
that of the best single classifier if the performance of such classifier is significantly 
better than that of the others (Phoneme and Letter data sets). In this case, our 
algorithm assigns a value of the coefficient close to 1 to the best single classifier, that 
is, the linear combination tends to select the best single classifier. 



5 Conclusions 

In this paper, we studied the error-reject trade-off of linearly combined multiple 
classifiers in the framework of the minimum risk theory. We reported a theoretical 
analysis of the error-reject trade-off under the assumption of independence among the 
errors of the individual classifiers. We showed that, under the hypotheses made, a 
linear combination of classifiers can approximate the optimal error-reject trade-off. In 
addition, we proposed a method for computing the coefficients of the linear 
combination and the value of the reject threshold. The experimental results reported 
showed that our method allows designing a linear combination of classifiers that can 
effectively improve the error-reject trade-off of the individual classifiers. 
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Fig. 4. Results for Satimage data set 
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Rejection Rate (%) 

Fig. 5. Results for Letter data set 
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Fig. 6. Results for Phoneme data set 
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Abstract. Amidst the conflicting evidence of superiority of one over the 
other, we investigate the Sum and majority Vote combining rules for the 
two class case at a single point. We show analytically that, for Gaussian 
estimation error distributions. Sum always outperforms Vote, whereas 
for heavy tail distributions Vote may outperform Sum. 



1 Introduction 

Among the many combination rules suggested in the literature [II l4lblbl/ISIHII 1)1 
Sum and Vote are used the most frequently. The Sum rule operates 
directly on the soft outputs of individual experts for each class hypothesis, nor- 
mally delivered in terms of aposteriori class probabilities. The fused decision is 
obtained by applying the maximum value selector to the class dependent aver- 
ages. When fusing by Sum the experts outputs can be treated equally or they 
could be assigned different weights based on their performance on a validation 
set. When independent experts are combined, equal weights appear to yield the 
best performance The properties of the rule have been widely investigated 



aiauiiBBaiiiEiiBi 



Vote, on the other hand, operates on class labels assigned to each pattern 
by the respective experts by hardening their soft decision outputs using the 
maximum value selector. The Vote rule output is a function of the votes received 
for each class in terms of these single expert class labels. Many versions of Vote 
exist, such as unanimous vote, threshold voting, weighted voting and simple 
majority voting incs]. In addition to these basic rules, the authors in HEl 
propose two voting methods claimed to outperform the majority voting. 

In our theoretical deliberations we focus on the basic Sum and Vote rules. 
Clearly both the weighted average (see e.g. [I2|) and modified voting m can 
outperform the basic rules. However, the advanced strategies require training 
which is a negative aspect of these approaches. In any case, we believe that the 
conclusions drawn from the analysis of the simple cases will extend also to the 
more complex procedures. 

Many researchers PITTITI have found that Sum outperforms Vote, while a 
few 10 have demonstrated that Vote can equal or outperform Sum. The aim 
of this paper is to investigate the relationship between these two rules in more 
detail. We shall argue that the relative merits of Sum and Vote depend on 
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the distribution of estimation errors. We show analytically that, for normally 
distributed estimation errors, Sum always outperforms Vote, whereas for heavy 
tail distributions Vote may outperform Sum. We then confirm our theoretical 
predictions by experiments on 2 class problems. 

The paper is organised as follows. In the next section we introduce the nec- 
essary formalism and develop the basic theory of classifier combination by aver- 
aging and majority voting. The relationship of the two strategies is discussed in 
Section 0 We draw the paper to conclusion in Section 0 



2 Theoretical Analysis 



Consider a two class pattern recognition problem where pattern Z is to be as- 
signed to one of the two possible classes Let us assume that we have 

N classifiers each representing the given pattern by an identical measurement 
vector X. In the measurement space each class oJk is modelled by the probability 
density function p{x\ujk) and the a priori probability of occurrence denoted by 
P(wfc). We shall consider the models to be mutually exclusive which means that 
only one model can be associated with each pattern. 

Now according to the Bayesian decision theory, given measurements x, the 
pattern, Z, should be assigned to class tUj, i.e. its label 9 should assume value 
9 = LOj, provided the aposteriori probability of that interpretation is maximum, 
i.e. 

assign 9 — >■ toj if 

P(0 = Wjjx) = max P(6* = Wfc|x) (1) 

k 

In practice, the j — th expert will provide only an estimate Pj(wi|x) of the 
true aposteriori class probability P(wi|x) given pattern x, rather than the true 
probability. The idea of classifier combination is to obtain a better estimate of the 
aposteriori class probabilities by combining all the individual expert estimates 
and thus reducing the classification error. A typical estimator is the averaging 
estimator 

1 ^ 

(2) 

i=i 

where P(o-’i|x) is the combined estimate based on N observations. 

Let us denote the error on the estimate of the i‘^ class aposteriori proba- 
bility at point X as ej(tUi|x) and let the probability distribution of the errors be 
Pij[ej{uji\x)]. Then the probability distribution of the unsealed error ei(x) 

N 

€i{x) = '^ej{cOi\x) 

i=i 



( 3 ) 
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on the combined estimate will be given by the convolution of the component 
error densities, i.e. 



/ OO poo 

/ p^l{Xl)pi2{^2->^l)■■■■PiN{e^{x)-XN_l)dXldX2■.■■dXN-l 

-OO j — OO 

( 4 ) 



The distribution of the scaled error £i(x) = ^ei(x) is then given by 

p{ei{x)) = p{^ei{x)) (5) 

In order to investigate the effect of classifier combination, let us examine the 
two class aposteriori probabilities at a single point x. Suppose the aposteriori 
probability of class ujg is maximum, i.e. P(ws|x) = P(uji\x) giving the 

local Bayes error es(x) = 1 — maxf^^^ P{uji\x). However, our classifiers only es- 
timate these aposteriori class probabilities and the associated estimation errors 
may result in suboptimal decisions, and consequently in an additional classifica- 
tion error. In order to quantify this additional error we have to establish what 
the probability is for the recognition system to make a suboptimal decision. This 
situation will occur when the aposteriori class probability estimates for the other 
class becomes maximum. Let us derive the probability of the event occurring for 
a single expert j for class Wi, i ^ s, i.e. when 

Pj{uji\x) - Pj{lOs\x) > 0 (6) 

Note that the left hand side of Q can be expressed as 

P{uji\x) - P{lOs\x) + ej{uJi\x) - ej{ujs\x) > 0 (7) 

Equation (0 defines a constraint for the two estimation errors ej(wfc|x), k = 1,2, 
as 

ej{iOi\x) - ej{u;s\x) > P(ws|x) - P{u;^\x) (8) 

In a two class case the errors on the left hand side satisfy ej(uis\x) = —ej{u!i\x) 
and thus an additional labelling error will occur if 

2ej(w^|x) > P{uJs\x) - P{uJi\x) (9) 

The probability eA{x) of this event occurring will be given by the integral of 
the error distribution under the tail defined by the margin APsi(x) — P(ws|x) — 
P(wi|x), i.e. 

poo 

e^(x) = / Pij[2ej{wi\x)]dej{u;i\x) (10) 

J ZiPsi(x) 

In contrast, after classifier fusion by averaging, the labelling error with respect 
to the Bayes decision rule will be given by 

pOO 

es{x) = I 

JAPsiix) 



p[2ei{x)]dei{x) 



( 11 ) 
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Now how do these labelling errors translate to classification error probabili- 
ties? We know that for the Bayes minimum error decision rule the error proba- 
bility at point X will be (x) . For the multiple classifier system which averages 
the expert outputs, the classification error probability is 

/3(x) = es(x) -k es(x)|Z\Pi 2 (x))| (12) 

Thus for a multiple classifier system to achieve a better performance the labelling 
error after fusion, 65 (x), should be smaller than the labelling error, e^(x), of a 
single expert. 

Let us now consider fusion by voting. In this strategy all single expert deci- 
sions are hardened and therefore each expert will make suboptimal decisions with 
probability C 2 i(x). When combined by voting for the most representative class, 
the probability distribution of k decisions, among a pool of N, being suboptimal 
is given by the binomial distribution. A switch of labels will occur whenever 
the majority of individual expert decisions is suboptimal. This will happen with 
probability 

ev(x)= f^)e^(x)[l-e^(x)]^-'= (13) 

k=f+i ^ ^ 

Provided ca(x) < 0.5 this probability will decrease with increasing TV. 

After fusion by Vote, the error probability of the multiple classifier will then 
be 

7 (x) = es(x) -k ey(x)|Z\Pi 2 (x))| (14) 

Before discussing the relationship between Sum and Vote in the next section, 
let us pause and consider the formulae (13 and 0 . The additional classification 
error, over and above the Bayesian error, is given by the second term in the 
expressions. Note that the term depends on the probability ex(x) of the decision 
rule being suboptimal and the margin AP 12 (x) . The former is also a function of 
the margin, the number of experts N and the estimation error distribution. Now, 
at the boundary Z\Pi 2 (x) = 0 and the multiple classifier system will be Bayes 
optimal, although at this point ex(x) is maximum. As we move away from the 
boundary Z\Pi 2 (x) increases but at the same time ejc(x) decreases. The product 
of the two nonnegative functions will be zero for Z\Pi 2 (x) = 0 and as Z\Pi 2 (x) 
increases, it will reach a maximum, followed by a rapid decay to zero. The above 
behaviour is illustrated in figure El for Sum and Vote combination strategies for 
normally distributed estimation errors with different values of cr(x) and N. We 
note that the additional error injected by the sum rule is always lower than 
that due to Vote. As the standard deviation of the estimation error distribution 
increases, the probability of the decision rule being suboptimal increases for all 
margins. At the same time the peak of the two functions shifts towards the 
higher margin values. As the number of experts increases the above relationship 
between cr(x) and Z\Pi 2 (x) is preserved. However, the additional experts push 
the family of curves towards the origin of the graph. 
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Direc Delta Error Distribution 







Fig. 1. Dirac delta error distribu- Fig. 2. Sum and Vote switching error for 
tion normally distributed estimation errors with 

cr(x) = .05, .15, .25 and .35, using 3, 7, 11 
and 15 experts 



3 Relationship of Sum and Vote 



In this section we shall investigate the relationship between the Sum and Vote 
fusion strategies. Assume that errors ej{oJi\x.) are unbiased, i.e. E{ej{oJi\x)'\ = 
E{Pj{u)i\yi) — P(wi|x)} = 0 Vijj, X and their standard derivatives are the same 
for all the experts, i.e. aj{uji\x) = cr(x) \/i,j Then, provided the errors ej{uii\x) 
are independent, the variance of the error distribution for the combined estimate 
i7^(x) will be 




(15) 



Let us also assume that the error distributions pij[ej (uJi\x)] are gaussian. For 
gaussian error the distribution of the difference of the two errors with equal mag- 
nitude but opposite sign will also be gaussian with four times as large variance. 
The probability of the constraint Q being satisfied is given by the area under 
the gaussian tail with a cut off point at P(ws|x) — P{oJi\x). More specifically, 
this probability, e^(x), is given by 



eA(x) = 1 - 

where er/( ) is the error function defined as 



(16) 



APs,{x) 1 I 

7 ^ ) = / 

4cr 2-\/^tT Jo 



APsiix.) 



_ 1 , 
exp 2 40^2 



(17) 



In order to compare the performance gains of the Sum and Vote fusion under 
the gaussian assumption we have designed a simulation experiment involving N 
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experts, each estimating the same aposteriori probability P{uii\x) i = 1,2. Es- 
timation errors are simulated by perturbing the target probability P{uji\x) with 
statistically independent errors drawn from a gaussian distribution with a zero 
mean and standard deviation (t(x). We have chosen the aposteriori probability 
of class to be always greater than 0.5. The decision margin ZiPi 2 (x) is given 
by 2P(o;i|x) — 1. The Bayesian decision rule assigns all the test patterns to class 
u>i. For each test sample the expert outputs are combined using the Sum rule 
and the resulting value compared against the decision threshold of 0.5. If the 
estimated aposteriori probability for a test sample from class Wi is less than 0.5 
or if the value is greater than 0.5 for a sample from class lv 2 an error counter 
is incremented. This particular method of estimating the probability of the de- 
cision rule being suboptimal, which we shall refer to as two class set testing is 
dependent on the random process of sampling the aposteriori class probability 
distributions. In order to eliminate the inherent stochasticity of the sampling 
process and its impact on the estimated error we also ran the same experiment 
by testing with samples from a single class. The corresponding one class set 
testing method involved samples from class uji only and the switching error was 
estimated by counting the number of misclassified patterns. 

Similarly, the decision errors of the majority vote are estimated by converting 
the expert outputs into class labels using the pseudo Bayesian decision rule 
and then counting the support for each class among the N labels. The label of 
the winning class is then checked against the identity of the test pattern and 
any errors recorded. The results are averaged over 500 experiments for each 
combination of P(wi|x) and cr(x), the parameters of the simulation experiment. 

The empirical results showing the additional error incurred are plotted as 
a function of the number of experts N in Figure 0 The results were obtained 
using the two class set testing approach. The theoretical values predicted by 
formulas diu and (USD are also plotted for comparison. The experimental results 
mirror closely the theoretically predicted behaviour. All the results shown in 
Figure 13 indicate that Sum outperforms majority Vote at all error levels and all 
margins Z\Pi 2 (x) except for the boundary where no improvement is possible. For 
a large number of experts Vote approaches the performance of Sum. However, 
for high values of <t(x) the initial discrepancy in performance between Sum and 
Vote is large and the convergence of the two strategies as the number of experts 
increases is slow. The slight positive bias of the empirical errors as compared with 
their theoretical predictions is believed to be due to sampling effects. As cr(x) 
increases the additional classification error also increases. In contrast, increasing 
the margin has the opposite effect. 

While under the Gaussian assumption the Sum rule always outperforms Vote 
it is pertinent to ask whether this relationship holds for other distributions. In- 
tuitively, if the error distribution has heavy tails it is easy to see that fusion 
by Sum will not result in improvement until the probability mass in the tail 
of pij[ej{ijJi\x)\ moves within the margin Z\Pi 2 (x). In order to gain better un- 
derstanding of the situation let us consider a specific example with the error 
distribution pij[ej{u}i\x)\ being defined as a mixture of three Dirac delta func- 
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Fig. 3. Comparison of experimental Sum and Vote switching errors with theoretical 
predictions, for normal noise at cr(x) = 0.25, using up to 100 experts. 



tions with the weights and positions shown in figure ^ Using the convolution 
integral in equation @ and substituting into d we can derive the probability, 
es(x) of the decision rule being suboptimal for a given margin Z\Pi2(x). Figure 
Efa) shows this probability as a function of the number of expert outputs fused. 
The function has been computed for a range of margins from Z\Pi 2 (x) = 0.04 
to Z\Pi 2 (x) = 0.2. The figure shows clearly an oscillating behaviour of e 5 (x). It 
is interesting to note that for small margins, initially (i.e. for a small number of 
experts) the error probability of the sum combiner has a tendency to grow above 
the probability of the decision rule being suboptimal for a single expert. First 
the performance improves when N = 2 but as further experts are added the 
error builds up as the probability mass shifts from the origin to the periphery 
by the process of convolution. It is also interesting to note that for N — 2 Vote 
degrades in performance. However, this is only an artifact of a vote tie not be- 
ing randomised in the theoretical formula. Once the first line of the probability 
distribution of the sum of estimation errors falls below the threshold defined 
by the margin between the two class aposteriori probabilities the performance 
dramatically improves. However, by adding further experts the error build up 
will start all over again, though it will culminate at a lower value than at the 
previous peak. We can see that for instance for Z\Pi 2 (x) = 0.04 the benefits from 
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Fig. 4. Sum and Vote switching error: A comparison of (a) single class and (b) two 
class experimental results and theoretical predictions for delta noise positioned at (1-p), 
using up to 20 experts 
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fusion by the sum rule will be very poor and there may be a wide range of N for 
which fusion would result in performance deterioration. Once the margin reaches 
0.16 Sum will generally outperform Vote but there may be specific numbers of 
experts for which Vote is better than Sum. In both figures 2|^a) and Hb) the 
position of the Dirac delta components of the error distribution offset from the 
origin is at -[1 — P(tui|x)]. Figure ^b) shows the additional effect of sampling 
the aposteriori class probability distribution inherent in the two class set testing 
approach. 

In contrast the corresponding probability, ey(x), given for the majority vote 
by formula diminishes monotonically (also in an oscillating fashion) with 
the increasing number of experts. Thus there are situations where Vote out- 
performs Sum. Most importantly, this is likely to happen close to the decision 
boundary where the margins are small. 

By the central limit theorem, as the number of experts increases, the prob- 
ability distribution of the sum of expert outputs will become more and more 
gaussian. At the same time the variance of the labelling error distribution will 
decay with a factor . Thus at some point the result of fusing N expert out- 
puts subject to error distribution in Figure 0 will be indistinguishable from the 
effect of fusing estimates corrupted by normally distributed noise with the same 
initial variance. For our distribution in Figure Q the standard deviation equals 
cr(x) = 0.357. From the experiments presented in Section Owe already estab- 
lished that for this regime Sum should be better than Vote. As the effective cr(x) 
is quite high it should take relatively long time for the two fusion strategies to 
converge which is borne out by the plots in Figure 0(a). 

In summary, for error distributions with heavy tails we can expect Vote to 
outperform Sum for small margins. At some point Sum will overtake Vote and 
build up a significant margin between the two which will eventually diminish as 
Vote converges to Sum from above. 



4 Conclusion 

The relationship of the Sum and Vote classifier combination rules was inves- 
tigated. The main advantage of these rules is their simplicity and their appli- 
cability without the need for training the classifier fusion stage. We showed 
analytically that, for normally distributed estimation errors. Sum always out- 
performs Vote, whereas for heavy tail distributions Vote may outperform Sum. 
We then confirmed our theoretical predictions by experiments on synthetic data. 
We showed for Gaussian error distributions that, as expected. Sum outperforms 
Vote. The differences in performance are particularly significant at high estima- 
tion noise levels. However, for heavy tail distributions the superiority of Sum 
may be eroded for any number of experts if the margin between the two apos- 
teriori class probabilities is small or for a small number of cooperating experts 
even when the margin is large. In the latter case, once the number of experts 
exceeds a certain threshold. Sum tends to be superior to Vote. 
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Abstract. We report the results from an experimental investigation 
on the complexity of data subsets generated by the Random Subspace 
method. The main aim of this study is to analyse the variability of the 
complexity among the generated subsets. Eour measures of complexity 
have been used, three from the minimal spanning tree (MST), the ad- 
herence subsets measure (ADH), the maximal feature efficiency (MFE); 
and a cluster label consistency measure (CLC) proposed in [7]. Our re- 
sults with the UCI “wine” data set relate the variability in data com- 
plexity to the number of features used and the presence of redundant 
features. 



1 Introduction 

Recently, Ho described three measures of complexity of classification tasks and 
related them to the comparative advantages of two methods for creating multiple 
classifiers, namely, the Bootstrap method and the Random Subspace method 
0. Here we report the results from a pilot experiment on the complexity of 
data subsets generated by the Random Subspace method. The main aim was 
to analyse the variability of the complexity among the generated subsets. The 
rationale behind this objective is the assumption that multiple classifier systems 
achieve best results when the individual classifiers are of similar accuracy. The 
intuition for this statement is that if the individual classifiers are very different 
in accuracy (they must be diverse in other ways for best performance!), then 
there will be: (a) at least one classifier which is much better than the rest of the 
team, and thus using the whole team will hardly improve on the best individual, 
or (b) at least one classifier which is much worse than the rest of the team, and 
using it in a combination will only degrade the overall performance. In other 
words, it is reasonable to expect to gain from using a team of classifiers when 
the classifiers are of approximately the same accuracy even if this accuracy is 
not too high jSj. 
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Another intuitive assumption is that data complexity is straightforwardly 
related to classification accuracy. Therefore, if the generating method produces 
subsets of similar complexity, we can expect that classifiers of similar accuracies 
can be built upon them. However, this does not mean that the classifiers will 
possess the necessary diversity to form a good team. 

Note that the individual accuracy and team diversity are different concepts. 
The members of the team might have the same accuracy and be identical or be 
as as diverse as the accuracy allows for. For example, let Di and D 2 be classifiers 
of equal accuracies, run on 100 objects. Assume that each classifier recognizes 
98 of the 100 objects. The classifiers might be the identical (failing on the same 
2 objects), “semi-diverse” (failing simultaneously on 1 object and separately 
on 1 object each) or diverse (each one failing on a different couple of objects). 
The individual accuracy is clearly related to the complexity of the problem but 
diversity is not. Depending on how diversity is defined, it may be bounded from 
above, and the bound will depend on the magnitude of the accuracy (c.f. [?]). 
On the other hand, the accuracy of the team is related to the diversity among 
the team members. Thus, we cannot expect a clear-cut relationship between 
the accuracy of the team and the complexity of the data sets on which the 
individual classifiers are designed. Therefore we confine the study to finding out 
about the variability of complexity of the data sets and do not attempt to relate 
this variability to the team accuracy. An analysis on this matter is out of the 
scope of this paper. 

One approach to enhancing diversity of the individual classifiers is to train 
them on different subsets of the available labeled data set. Let Z = {zi, . . . ,zn} 
be a labeled data set, Zj S K", j = with N elements. According to 

the random subspace method the individual classifiers are based on different 
subsets of features, i.e., on different subspaces of the feature space 5R". Ho |3] 
shows that random sampling (without repetition) to get a set of d < n features 
from the integers from 1 to n, is a viable line for building multiple classifier 
systems. 

We applied four measures of complexity: the minimal spanning tree (MST), 
the adherence subsethood (ADH) based on the e-neighborhood measure, the 
maximum feature efficiency (MFE), all three from and a measure which we 
call the Cluster Label Consistency (CLC), introduced in our previous study [7]. 
In this study we bring in some results from [7] and continue with an additional 
study on the variability in complexity of the data sets generated by the Random 
Subspace method. 



2 Measures of Complexity 

2.1 Minimal Spanning Tree 

Given the data set Z and a metric on 5ft”, a minimal spanning tree can be 
constructed which connects all the sample points regardless of their class labels. 
Here we use the Euclidean distance as the metric throughout this study. Some 
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edges of the MST will connect points from different classes and the count of 
such edges gives us a measure of the length of the boundary between the classes. 
Since there are iV — 1 edges for N sample points, the count can be expressed as 
a percentage of iV 0 , leading to a complexity measure 

M ST complexity = , (1) 

where is the number of edges in the minimal spanning tree connecting dif- 
ferent classes. 

2.2 Adherence Subsets 

This method proposed by Ho ^ considers the clustering properties of the data. 
It is based on a reflexive and symmetric (tolerance) binary relation TZ between 
two points X, y in a set F. TZ is defined by xTZy d{x, y) < e, where d(x, y) is a 
given metric and e is a given non-zero constant. We define F{x) = {j/ S F\yTZx} 
to be the e-neighbourhood of x. An adherence mapping, ad from the power set 
V{F) to V{F) is such that: 

{ ad{4>) = (j) 

ad{x) = F{x) 

ad{A) = ad{x) VA C F. 

The largest possible adherence subsets can be grown for each point by suc- 
cessively expanding the adherence subset at each stage whilst ensuring that 
all newly included points come from the same class. For example, ad°({a;}) = 
{a:}, ad^({x}) = ad({a;}), od^({a:}) = ad{ad{{x})) . . gives us progressively 
higher order adherence subsets. For each point only the highest order subset is 
retained such that all elements are from the same class. This procedure defines 
a partition of the data set where each cluster contains data points with the 
same class label. The number of such clusters is an indication of the complexity 
of the problem. If the classes are compact and far from each other, then each 
class will ideally form a single separate adherence subset. When the classes are 
overlapping, multiple clusters are likely to appear. 

The calculation works by taking a labelled data set Z of size N and for each 
point growing the largest possible adherence subset such that all elements of the 
subset are from the same class. The complexity is then given by: 

N 

AD F{ complexity = — (2) 

where W is the number of different adherence subsets.. 

In Ho’s paper ^ the choice of e was e = 0.555 where 6 was the minimal 
distance between two points of different classes. In a preliminary experiment we 
studied the effect of e on the complexity value and found that the relationship 
between e and AD H complexity is not monotonic. Indeed, if e is too small, then 
each point will be a cluster on its own, and AD FI complexity = 1. On the other 
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hand, if e is too large, then the e-neighbourhood of x will contain point(s) from 
a different class. Again, x will be marked as a cluster on its own, leading to 
ADHcomplexity = 1. Since there seems to be no clear reason for choosing a 
particular e, we picked e = min -I- 0.1 * (max — min), where min and max were 
the minimum and maximum distances in the data set regardless of class labels. 

2.3 Maximum Feature EfRcieucy 

This method is suitable for 2 classes only. The complexity on each feature is 
assessed separately. All points are projected on that feature axis and the overlap 
interval is found. The MFE complexity for the Ath feature is 



where Ni is the number of points within the overlap interval. The final complexity 
value is defined as 



2.4 Cluster Label Cousisteucy 

This measure estimates how well the classes match the possible clusters in data. 
First c clusters are obtained on the whole data set regardless of the class labels 
and then the labels are used to count the number from each class within each 
cluster. “Pure” clusters will give low complexity values whereas “contaminated” 
clusters will give high complexity values. The complexity measure is 



where Ci is the cluster label consistency of cluster i, found as the fraction of the 
maximal number of points of the same class label in the cluster. In case of a 
perfect match, i.e., when each class is a cluster on its own, the complexity is 0. 

Consider as an example a data set distributed according to a mixture of 
5 Gaussians in 3?^ with centers (0, 0), (2, 3), (0, 4), (3, 1) and (2,4), respectively, 
and variance 0.4 along each axis. The left plot in Figure ^shows the clusters and 
their centroids. A circular decision boundary is applied on the data set centered 
at (1,2) with radius 2. All points inside the circle are labeled in class u>i, and 
the points outside the circle are labeled in oj 2 , as illustrated on the right plot in 
Figure n Thus, the Bayes error for this data model is zero. The peculiar feature 
about this data set is that the cluster structure of the data is not representative 
for the true class structure. 

The quadratic discriminant classifier gave a 17 % training error on this data 
set. The following complexity values have been obtained: 




( 3 ) 



M F Ecomplexity = lain M F Ecomplexityi. 



( 4 ) 




( 5 ) 
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Fig. 1. A 5-clusters example of a problem where perfect separation is possible but the 
complexity is high because the cluster structure of the data is not representative for 
the class label structure. 

M STcomplexity 0.1200 C LC complexity (c = 2) 0.4782 

complexity 1.0000 C LC complexity (c = 3) 0.4789 

M F Ecomplexity 0.6800 C LC complexity (c = 4) 0.4408 

C LC complexity (c ~ 6) 0.4632 

The purpose of showing this example was to highlight two seemingly dis- 
couraging observations. First, the achievable accuracy (100 % in this case) is not 
necessarily related with the measures of complexity. Second, the results give us 
an early indication of the severe disagreement between the measures of complex- 
ity despite the fact that they are all meant to measure the same characteristic 
of the data set. However, we note that: (1) in real problems, the class-cluster 
relationship may be less deceiving than in this example, and (2), the difference 
in the values of the complexity measures shows that the notion “complexity” 
needs a stricter definition beyond the common intuition. 

3 Experiments 

3.1 Data 

We used the “wine” data set from the UCI Repository of Machine Learning 
Databas^ It contains 178 cases labeled in 3 classes, with 13 continuous-valued 
features and no missing values. From this data set we derived the following 7 
problems 

Case A: 1 V 2 V 3. 

Case B: 1 v (2 and 3). 

Case C: 2 V (1 and 3). 

Case D: 3 V (1 and 2). 

Case E: 1 v 2. 

Case F: 1 v 3. 

Case G: 2 V 3. 

^ Found at [http://www.ics.uci.edu/ mlearn/MLRepository.html] 
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Applying the Random Subspace method, we formed 50 subsets by randomly 
choosing 5 of the n = 13 features. The four complexity measures were calculated 
for each data set. 



Table 1. Complexity calculated by MST, ADH, MFE and CLC (in %) with the 
Random Subspace method for the 7 cases 



Case 


MST 


ADH 


MFE 


CLC 


mean 


std 


mean 


std 


mean 


std 


mean 


std 


A 


21 


7 


100 


0 


- 


- 


35 


11 


B 


12 


5 


99 


4 


44 


14 


23 


11 


C 


18 


5 


100 


3 


57 


15 


30 


6 


D 


13 


8 


95 


17 


34 


12 


24 


4 


E 


13 


5 


99 


7 


39 


17 


19 


10 


F 


9 


5 


98 


14 


8 


17 


21 


14 


G 


18 


10 


97 


10 


39 


18 


28 


12 



3.2 Results 

Table E shows the means and the standard deviations of the 4 measures 
and the 7 cases. As in the example at the end of the previous section, the 
measures give very different values. Knowing that the three of them (except 
CLC) span approximately the same intervals (0.01 < M ST complexity < 0.99, 
0.02 < AD H complexity < 1, and 0 < M F Ecomplexity < 1), the differences in 
the complexity values are puzzling. 

In [7] we carried out similar experiments for the Bootstrapping and the Data 
splitting methods too. To compare visually the variability of the Random Sub- 
space method with the other two, we display in Figure 0 the means for the 21 
experiments (3 methods x 7 cases) and the minima and maxima as the error 
bars. (For the MFE, there are (3 methods x 6 cases) because it works for two 
classes only, and case A is excluded.) 

In all 4 subplots, the first 7 (six for MFE) bars are for the Bootstrapping 
method, the next 7 (6) are for the Data splitting method, and the last 7 (6) 
correspond with the Random Subspace method in the same order of the cases 
(A to G). 

A common finding of all complexity measures is that the Random Subspace 
method for creating data subsets offers the highest variability of the complexity 
of the obtained sets. However, this seems to be the only finding where the four 
complexity measures agree. For example, while the ADH measure designates 
the Random Subspace method as producing the least complex data (Figure Ej), 
the MFE measures classes these data set as the hardest. 

The Bootstrap method has the lowest standard deviations (on all four mea- 
sures) indicating that the data sets obtained exhibit complexity of a similar 
value. 
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M ST complexity AH Dcomplexity 




MFEcomplexity C LC complexity 




Fig. 2. The mean and limits for the 4 complexities. Bars 1-7: Bootstrapping, 8-14: 
Data Splitting, 15-21: Random Subspace 



These two findings can be explained by the following 

1. The Bootstrap method creates data subsets by small variations of the 
original data set. Consequently, the variations in complexity among such data 
subsets can be expected to be small. In fact, as pointed out by Breiman PJ, 
unstable classifiers are necessary to exploit effectively the low diversity of the 
data subsets generated by the Bootstrap method. Neural networks are examples 
of such unstable classifiers, and, curiously, we rely on their ability to overtrain. 

2. Differently, the Random Subspace method usually generates data subsets 
exhibiting very different complexities because projecting the data set on a sub- 
space may lead to a very different pattern of the classes’ disposition. However, 
it should be noted that such variability in complexity strongly depends on the 
number of features used. It is easy to argue that variability in complexity among 
subsets should decrease as the number of randomly picked features increases. 
In addition, we think that complexity variations also depend on the degree of 
redundancy among the features. For example, if we have picked a subset of fea- 
tures containing redundant feature Xi, and another subset containing redundant 
feature Xj on its place, the difference in complexity between the two data sets 
will not be great. Ho conjectures in ^ that multiple classifier systems based on 
the Random Subspace method are expected to perform well exactly when there 
is a certain redundancy in the feature set. This comes in support of our intu- 
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itive hypothesis that data sets of similar complexity (hence similar individual 
accuracy) are a better basis for a classifier combination system. 

3.3 Additional Experiments on Random Subspace Method 

In order to investigate how the complexity variations of Random Subspace de- 
pends on the number of random features used and on the degree of redundancy 
among the features, two experiments have been carried out: 

1. We created subsets by randomly choosing 5, 7, 9, 11 of the 13 features of the 
wine data set. For each number of features, 50 subsets were created and the mean 
and the standard deviations were computed. We performed the experiments for 
all seven problems A-G. 

2. We created a “redundant” version of the wine data set by adding 13 re- 
dundant features to the original feature space. The new 13 features were created 
by adding the first feature to each of the others, i.e., the new feature set consists 
of {Xi , . . . , Ai 3 , 2Xi,X 2 + Xi, , Xi 3 -I- Al}. For each of the seven problems 
A-G, 50 subsets were generated by randomly choosing 5 of the 13 features and 
the mean and the standard deviations were computed. 

Figure 0 shows the behaviour of the four complexity measures as a func- 
tion of the number of random features, for cases A to G. The means and the 
standard deviations are displayed. As it was hypothesized, the behaviour of the 
standard deviation shows that complexity variations among subsets decrease as 
the number of features increases whereas the mean tends to level off. 

TableElshows the standard deviations obtained by sampling from the original 
feature space and the augmented feature space for the seven cases and the 4 
complexity measures. The standard deviations for the augmented feature space 
are lower than the ones for the original feature space in most of the cases: for 
all 7 cases with MST; for 3 cases with ADH; for 5 cases with GLG; and 2 of the 
6 possible cases with MFE. This points out that complexity variation among 
the generated data sets depends on the degree of redundancy as anticipated: the 
higher the redundancy, the smaller the variability. 



Table 2. Standard deviation (%) of the complexities for the Random Subspace method 
on the original and redundant feature spaces (augmented with redundant features). 



Case 


MST 


ADH 


CLC 


MFE 


orig 


redn 


orig 


redn 


orig 


redn 


orig 


redn 


A 


8.04 


4.35 


5.45 


12.05 


11.37 


5.68 


- 


- 


B 


5.68 


2.12 


6.88 


9.21 


10.69 


6.85 


15.86 


13.67 


C 


6.30 


3.48 


3.73 


8.95 


5.81 


8.39 


17.24 


18.61 


D 


8.01 


3.73 


14.77 


11.40 


3.68 


4.64 


13.17 


17.35 


E 


4.90 


3.01 


9.44 


10.94 


10.63 


3.42 


14.94 


14.43 


F 


6.12 


4.50 


25.52 


21.89 


12.19 


9.85 


15.53 


27.80 


G 


10.86 


6.41 


16.17 


14.42 


11.55 


8.06 


14.06 


18.95 
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Fig. 3. Plots of the means (left) and standard deviations (right, shaded) of the 4 
complexity measures versus the number of features selected at random. 50 experiments 
have been carried out for each of the cases A-G for each of 5, 7, 9, and 11 features. 



4 Conclusions 



In this paper, we considered the Random Subspace method for generating data 
sets for multiple classifier systems. We used 4 measures of complexity: the mini- 
mal spanning tree (MST), the adherence subsets measure (ADH), the maximal 
feature efficiency (MFE); and a cluster label consistency measure (CLC). Our 
results with the UCI “wine” data set led us to the following conclusions: 

1. Random Subspace method usually generates data subsets exhibiting dif- 
ferent complexities. The variability in complexity is higher than that for boot- 
strapping and data splitting (Figure |2|). All 4 measures are capable of detecting 
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this variablity, with ADH failing to distinguish between the complexity of the 
data sets for 7, 9 and 11 randomly selected features. 

2. The variability in complexity of the data sets generated by the Random 
Subspace method is related to the number of features being selected. Our exper- 
iment showed that complexity variations among subsets decrease as the number 
of features increases whereas the mean complexity tends to level off (Figure |3). 
This complies with our hypothesis based on the idea that the more feature we 
use, the greater the chance for getting data sets on highly overlapping subspaces 
and hence the more similar the complexity. 

3. The redundancy in the feature set generally leads to generating sets of 
more similar complexity compared to sets obtained from a feature set with little 
or no redundancy. While MST and CLC support this intuition (100 % for MST 
and ^70 % for CLC), ADH and MFE produce dubious results. According to the 
latter two measures, there is no clear pattern of reduction of the variability of 
the complexity value when redundant features are used. 

Since there is no consensus on a single definition of complexity, we agree 
with [4] that at this point we can use a (probably restricted) set of measures as 
a “complexity vector” . This vector can be further used to select an appropriate 
classifier model for a certain data set or to indicate whether a collection of subsets 
is a suitable basis for a multiple classifier system. 
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Abstract. For learning purposes, representations of real world objects 
can be built by using the concept of dissimilarity (distance). In such a 
case, an object is characterized in a relative way, i.e. by its dissimilarities 
to a set of the selected prototypes. Such dissimilarity representations are 
found to be more practical for some pattern recognition problems. 
When experts cannot decide for a single dissimilarity measure, a num- 
ber of them may be studied in parallel. We investigate two possibilities 
of combining either dissimilarity representations themselves or classifiers 
built on each of them separately. Our experiments conducted on a hand- 
written digit set demonstrate that when the dissimilarity representations 
are of different nature, a much better performance can be obtained by 
their combination than on individual representations. 



1 Introduction 

An alternative to the feature-based description is a representation based on dis- 
similarity relations between objects. In general, dissimilarities are built directly 
on raw or preprocessed measurements, e.g. based on template matching. The use 
of dissimilarities is especially of interest when features are difficult to obtain or 
when they have a little discriminative power. Such situations are encountered in 
practice when there is no straightforward manner to define features, when data is 
highly dimensional or when features consist of both, continuous and categorical 
measurements. The choice in favor of dissimilarity representations depends also 
on the application or the data itself. For instance, some particular characteristics 
of objects or measurements, like curves or shapes, may naturally lead to such 
representations, since they make recognition tasks more feasible. 

To construct a decision rule on dissimilarities, the training set T of size n and 
the representation set i? 0 of size r will be used. R consists of prototypes which 
are representatives of all classes present. In the learning process, a classifier is 
built on the n x r dissimilarity matrix D (T,R), relating all training objects to 
all prototypes. The information on a set S oi s new objects is provided in terms 
of their distances to R, i.e. as an s x r matrix D {S, R). 

A conventional way to discriminate between objects represented by dissimi- 
larities is the nearest neighbor rule (NN) p. This method suffers, however, either 
from a potential loss of accuracy when a small set of prototypes is selected or 
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from its sensitivity to noise. To overcome these limitations, we have proposed 
an another approach. Our suggestion is to treat the dissimilarity representation 
D (T, R) as a description of a space where each dimension corresponds to a dis- 
tance to an object. D (x, R) can be, therefore, seen as a mapping of x onto an 
r-dimensional dissimilarity space. The advantage of such a representation is that 
any traditional decision rule operating on feature spaces may be used. 

Most of the commonly-used dissimilarity measures, e.g. the Euclidean dis- 
tance or the Hamming distance, are based on sums of differences between mea- 
surements. The choice of Bayesian classifiers 0, assuming normal distributions, 
is a natural consequence of the central limit theorem applied to them, when a 
large number of measurements is considered. The LNC (Linear Normal densi- 
ties based Classifier) ^ is especially of interest because of its simplicity. Such a 
suggestion is strongly supported by our earlier experiments 

Selecting a good dissimilarity measure becomes an issue for the classifica- 
tion problem at hand. When considering a number of different possibilities, it 
may happen that there are no convincing arguments to prefer one measure over 
another. Therefore, the interesting question is whether combining dissimilarity 
representations might be beneficial. Two possibilities are here consider to study 
this problem. In the first one, the base classifiers (the LNC or the NN rule) 
are found on each dissimilarity representation separately and then combined 
into one decision rule. If the representations differ in character, a more powerful 
decision rule may be constructed by combining them. Secondly, instead of com- 
bining classifiers, representations are combined to create a new representation 
for which only one classifier has to be trained. 

The paper is organized as follows. SectionQgives some insight into the dissim- 
ilarity representations, classifiers and combining rules used. Section 01 describes 
the dataset and the experiments conducted. Results are discussed in section E] 
and conclusions are summarized in section O 

2 Combining Dissimilarity Representations 

Assume that we are given the representation set R and p different dissimilarity 
representations D^^\T,R), D^'^\T,R), ..., D^p\T,R). Our idea is to combine 
good base classifiers, but on distinct representations. It is important to emphasize 
that the distance representations should have different character, otherwise they 
convey similar information and not much can be gained by their combination. 

Two cases are here considered. In the first one, a single LNC is trained on 
each representation (T, R) separately and then all of them are combined in 
the end. In the second case, the NN rule is also included. The NN rule and 
the LNC differ in their decision-making process and their assignments. The NN 
method operates on dissimilarity information in a rank-based way, while the 
LNC approaches it in a feature-based way. Although the recognition accuracy of 
the NN method is often worse than of the LNC |BI, still better results may be 
obtained when both types of classifiers are included in the combining procedure. 
Although many possibilities exist for combining classifiers [51 , we limit ourselves 
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to fixed rules operating on posterior probabilities. For the LNC, the posterior 
probabilities are based on normal density estimates, while for the NN method, 
they are estimated from distances to the nearest neighbor of each class j5|- 
Another approach to learning from many distinct dissimilarity representa- 
tions is to combine them into a new one and then train e.g. a single LNC. As 
a result, a more powerful representation may be obtained, allowing for a bet- 
ter discrimination. The first method for creating a new representation relies on 
building an extended representation D^xt, in a matrix notation given by: 



D,xt{T,R)=\D^^\T,R) D^^\t,R) ... D^p\t,R) 



( 1 ) 



It means that a single object is now characterized by pr dissimilarities coming 
from p various representations, but still computed to the same prototypes. The 
requirement of having the same prototypes is not crucial, however, for the sake 
of simplicity, the same representation sets are used here. 

In the second method, all distances of different representations are first scaled 
so that they all take values in a similar range. Then, the final representation is 
created by computing their sum, as shown below: 

p 

D,^^{T,R) = Y,D^1.{T,R), (2) 

where Dmax{T, R) = ai (T, R) and a^’s scale all representations so that their 
maximum values become equal. (Note that now the representation sets should 
be identical to perform the sum operation.) The scaling procedure is necessary, 
otherwise the new representation will copy the character of a representation con- 
tributing the most to a sum, i.e. one with the largest distances. Scaling changes 
the orders of magnitude, but not the rankings, therefore all neighbor informa- 
tion is preserved. More sophisticated possibilities of scaling can be considered, as 
well, e.g. the weighted sum or the median from a sequence of dissimilarity values 
of different representations but relating a training object to the same prototype. 



3 Dataset and Experiments 

To illustrate our point, we investigate a 2-class classification problem between 
the NIST handwritten digits 3 and 8 PI!. The digits are represented as 128 x 128 
binary images. Since no natural features arise from the application, constructing 
dissimilarities is an interesting possibility to deal with such a recognition prob- 
lem. Three dissimilarity measures are considered: Hamming, modified-Hausdorff 
0 and ’blurred’, resulting in the representations: Dh, Dmh and Db corre- 
spondingly. The Hamming distance counts the number of pixels which dis- 
agree. The modified-Hausdorff distance is found useful for template match- 
ing purposes |n|. It measures the difference between two sets (here two con- 
tours) A = {oi, . . . , ttg} and B = {bi, , bh} and is defined as Dmh{A, B) = 
max{hM {A,B),hM (B,A)), where Hm (A,B) = j ||a-&||- To find 
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Blurred-Mod.Hausd. Blurred-Hamming Mod. Hausd.-Hamming 




0 0.2 0.4 0 0.2 0.4 0 0.2 0.4 

Blurred-Mod.Hausd. Blurred-Hamming Mod. Hausd.-Hamming 




0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 



Fig. 1. Spearman coefficients (top) and traditional correlation coefficients (bottom) 
comparing dissimilarity representations. 



Db, images are first blurred with the Gaussian kernel and the standard devia- 
tion of 8 pixels. Then the Euclidean distance is computed between the blurred 
versions. The resulting distances are referred as to the ’blurred’ distances. 

Each of the distance measures uses the image information in a particular way: 
binary information, contours or blurring. From the process of the construction, 
it follows that our dissimilarity representations differ in properties. To prove, 
however, their different characteristics, the Spearman rank correlation coefficient 
is used to rank the distances computed to each prototype. Basically, we want 
to show that the rankings differ between representations. Therefore, for each 
pair of representations, the Spearman coefficients between the distance rankings 
to all prototypes are computed. Histograms of their distributions are presented 
in Fig. d All coefficients are between —0.05 and 0.4, where most of them are 
smaller than 0.2, which implies that the rankings differ significantly. 

The traditional correlation coefficient is used to check whether the dissimilar- 
ity spaces of the individual representations (and, therefore, linear classifiers built 
there) are different. Such correlation values are higher than those given by the 
Spearman rates. It is to be expected, since now the exact distances are consid- 
ered, which cannot completely vary from one representation to another, since the 
representations are descriptions of the same data and the same relations. On av- 
erage, the correlations are found to be (see Fig. 0.39 between the blurred and 
modified Hausdorff, 0.56 between the blurred and Hamming and 0.28 between 
the modified Hausdorff and Hamming. In the end, most coefficients are smaller 
than 0.6, thereby, they indicate only weak linear dependencies. Consequently, 
we can say that our dissimilarity representations differ in character. 

The experiments are performed 25 times and the results are averaged. In a 
single experiment, the data, consisting of 1000 objects per class, is randomly 
split into two equally-sized sets: the design set L and the test set S. Both L 
and S contain 500 examples per class. The test set is kept constant, while L 
serves for obtaining the training sets Ti, T2, T3 and T4 (being subsets of L) 
of the following sizes: 50, 100, 300 and 500 (= L). For each training set, the 
experiments are conducted with varying size of the representation set R. Here, 
for simplicity, R is chosen to be a random subset of the training set. 
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Digits 3 and 8; TR: 500 per class Digits 3 and 8; TR; 500 per class 





Fig. 2. Averaged classification error of the individual LNC’s (left) and NN rules (right) 
as a function of the representation set size for the training set T 4 ,. 



Digits 3 and 8; TR: 300 per class Digits 3 and 8; TR: 500 per class 





Fig. 3. Averaged classification error as a function of the representation set size for the 
individual NN rules trained on the sets T 3 (left) and T 4 (right). 

4 Discussion 

Considering single classifiers, it appears that the LNC consistently outperforms 
the NN rule for training sets: Ti — T 4 . Also, the LNC built on the blurred 
dissimilarities reaches a higher accuracy than for the other two representations. 
Since this behavior is repeated over all training sets, only the performance of 
the individual classifiers for the largest training set T 4 is presented in Fig. 0 
The results of combining either classifiers or representations for different 
training sets are presented in Fig. 0- 0 These small, moderate and large train- 
ing sets are considered to investigate the influence of the training size on our 
combining results. All plots in Fig. 0-101 show curves of averaged classification 
error (based on 25 runs) together with its standard deviation. Each error curve 
is a function of the representation set size, where the largest representation set 
considered is about half of the training set. Since our goal is to improve the 
performance of single classifiers by combining the information, all the results are 
presented with respect to the behavior of the LNC on the blurred representation 
Db, as to the one that reaches the highest individual accuracy overall. 
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Fig. 4. Averaged classification error as a function of the representation set size for the 
individual LNC’s combined by the product, mean or max rule or for both the LNC’s 
and the NN methods combined by the mean operations. 

Fig. 0 presents the generalization errors obtained for combining three indi- 
vidual NN methods by the mean, maximum and product rules. Operating on 
posterior probabilities is motivated by the intention of combining both the LNC 
and the NN method further on. Although the estimation of these probabilities is 
rather crude for the NN method, it still allows for an improvement of the com- 
bined rules. In all cases, the combination by the mean, max or product operation 
gives significantly better results than each individual NN rule. The larger, both 
training and representation sets, the more indicative gain in accuracy. 

Fig. 0 shows the error curves obtained for three individual LNC’s combined 
by the mean, maximum and product rules. For all training sets and small rep- 
resentation sets (in comparison to the training set size) considered, the product 
and maximum rules give slightly better results than the mean rule. However, for 
larger representation sets, the mean rule performs better. In addition, the error 
curve for the mean combiner of both the LNC and NN method is also shown. It 
can be observed, that incorporating the NN rule to the combiner, lowers some- 
what the classification errors for larger representation sets. (This does not hold 
for small representation sets due to bad performance of each individual NN rule.) 

Fig. 0presents the error curves of a single LNC operating on new dissimilar- 
ity representations constructed from the three given: Db, Dmh and Dh- Two 



On Combining Dissimilarity Representations 365 



Digits 3 and 8; TR; 50 per class Digits 3 and 8; TR: lOOperclass 




Digits 3 and 8; TR: 300 per class Digits 3 and 8; TR: 500 per class 




Fig. 5. Averaged classification error of the LNC as a function of the representation set 
size for the combined representations. 



different cases are here considered: an extended representation Dext (0 and the 
combined representation Dgum 0 ). The LNC on significantly outperforms 

the individual LNC’s (it reaches higher accuracy than the best individual result 
on Db), which is observed for all training sets. The LNC on Dext can gain even 
better accuracy, however, the comparison between the representations Dgum and 
should be explained carefully. If the LNC is trained on Dgum using, say, 
r prototypes per class, then the representation D^xt is built from three such 
representations, each based on r prototypes, thereby the LNC operates in a 
3r-dimensional space. It means that for larger representations sets, the total 
number of dimensions exceeds the training size. The LNC is then not defined 
since the sample covariance matrix becomes singular and its inverse cannot be 
determined. In such cases, a fixed, relatively large regularization is used For 
moderate representation sizes (for which the dimensionality of D^xt approaches 
the number of training examples) the error curve of the LNC shows a peaking 
behavior (characteristic for this classifier). Therefore, worse performance is ob- 
served when number of prototypes is close to one third of the training size. For 
either small or larger representation sets, a very good performance is reached. 

Fig. 0 presents the comparison between the mean combiner of individual 
classifiers and the LNC trained on the combined representation Dgum- For larger 
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Digits 3 and 8; TR; 50 per class 




Digits 3 and 8; TR: 100 per class 




Digits 3 and 8; TR: 300 per class 



Digits 3 and 8; TR: 500 per class 





Fig. 6. Comparison between the accuracy of the combined classifiers found on each 
representation separately and one LNC on the combined representation Dsum- The 
classification error is as a function of the representation set size. 

representation sets, the LNC trained on Dsum works somewhat better than the 
combined decision rule consisting of the LNC’s and NN methods. 

Summarizing, most of the combining rules 
perform significantly better than the individ- 
ual classifiers. For small dissimilarity spaces, 
the representations tend to be independent 
and, therefore, the product rule based on the 
LNC’s is expected to give better results than 
the mean rule PI (here observed only slightly). 

For larger dissimilarity spaces, the posterior 
probabilities are not well estimated, and the 
product rule deteriorates; then the mean com- 
biner is preferred. For the NN rule, the posterior probabilities are estimated from 
distances to the nearest neighbor and do not depend on the dimensionality of 
the problem. Therefore, both combiners perform about the same. 

To illustrate the importance of dissimilarity representations of different na- 
ture, we present an example where the Hausdorff dissimilarity D^s is used 
instead of the Hamming distance. Therefore, a triple {Db, Dmh, Dhs} is con- 
sidered. The Hausdorff distance and the modified Hausdorff distance are similar. 



Hausd.-Mod. Mausd. Hausd.-Mod. Mausd. 




0 0.2 0.4 0.6 0.8 1 



Fig. 8. Spearman and traditional 
correlation coefficients comparing 
Dmh and Dhs- 
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Digits 3 and 8; TR: 500 per ciass Digits 3 and 8; TR: 50 per ciass 




Fig. 7. Comparison of the classification error for the combined LNC’s (left) and NN 
rules (right) and for two representation triples: {Db, Dmh , Dh} and {Db , Dmh , Dhs} 
and the training set T4. Combining is done by applying the mean rule. 



however, the latter violates the triangle inequality. Therefore, in the modified 
Hausdorff representation the dissimilarity rankings are changed with respect to 
the Hausdorff one. However, the dissimilarity spaces Dmh and Dhs are rather 
similar. In Fig. El histograms of both the Spearman and traditional correlation 
coefficients for these two representations are plotted. The Spearman values are 
similar to those obtained before (compare Fig.QJ, but the traditional correlations 
become much higher, on average 0.91, indicating high dependence between those 
two dissimilarity spaces. It means that although by combining the individual NN 
rules for Db, Dmh and Dhs an essential improvement may be gained, it does 
not necessarily hold for combining the LNC’s. Fig. 0 presents the comparison 
between the performances of such classifiers combined by the mean rule for the 
training set T4. It can be clearly observed that when Dhs is used instead of 
Dh, the performance of the combined LNC’s deteriorates. Still, the combined 
NN rules are behaving only somewhat worse than for the triple {Db, Dmh, Dh}- 

When the Hausdorff representation is added to the original three, the perfor- 
mances of the combined individual classifiers or the LNC on Dgum are slightly 
better or not at all. The only significant improvement is observed for the ex- 
tended representation Dext- 

5 Conclusions 

Combining a number of distance representations may be of interest when there is 
no clear preference for a particular one. It can be beneficial when the dissimilarity 
representations emphasize different data characteristics. This is illustrated by a 
2-class recognition problem between the digits 3 and 8 for three dissimilarity 
representations: Hamming, modified Hausdorff and blurred. 

We have analyzed two possibilities of combining such information, either by 
combining classifiers or by combining representations themselves. In the first 
approach, individual classifiers are found for each representation separately and 
then they are combined into one rule. Our experiments show that the mean 
combining rule works well, especially for larger representation sets (with respect 
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to the training size) . In comparison to the best results of individual classifiers, the 
mean combiner based on three LNC’s (built on each representation separately) 
or even better, the mean combiner based on three LNC’s and three NN methods, 
performs significantly better. 

In the second approach, dissimilarity representations are combined into a 
new one on which a single LNC is built. They are first scaled so that their maxi- 
mal values are equal and then summed up, resulting in the representation Dgum 
(see 0 ). We have also investigated scaling, e.g. by making the means identical 
or the maximum values for each prototype equal. They gave worse results and, 
therefore, are not reported here. The LNC on Dgum significantly improves the 
results of each individual LNC. It appears that the combined representation, 
built in this way, has a more discriminative power. As a reference, the extended 
representation is also considered (see ©)■ The LNC on such a representa- 
tion reaches even better results than on Dsumj provided that the number of all 
prototypes is either small or large in comparison with the training set size. 

In conclusion, when dissimilarity representations differ in character, com- 
bining either individual classifiers or by creating a new representation can be 
beneficial. In our experiments, we have shown that when distinct representa- 
tions are combined into dsumj as a result, a representation which allows for a 
better discrimination can be obtained. This not only improves the classifier, but 
also it is of interest because of the computational aspect. 
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Abstract. In previous work, we have confirmed the performance gains 
that can be obtained in speaker recognition by splitting the (clean) wide- 
band speech signal into several subbands, employing separate pattern 
classifiers for each subband, and then using multiple classifier fusion 
(‘recombination’) techniques to produce a final decision. However, our 
earlier work used fairly rudimentary recognition techniques (dynamic 
time warping), just sum or product fusion rules and the spoken word 
seven only. The question then arises: Can subband processing still deliver 
performance gains when using state-of-the-art recognition techniques, 
more sophisticated recombination, and different spoken digits? To an- 
swer this, we have applied hidden Markov modelling and artihcial neural 
network (ANN) recombination to text-dependent speaker identification, 
for spoken digits seven and nine. We find that ANN recombination per- 
forms about as well as the sum rule operating in log probability space, 
but the ANN results are not unique. They depend critically on user- 
specified parameters, initialisation, etc. On clean speech, all classifiers 
achieve close to 100% identification. Subband techniques offer advan- 
tages when the speech signal is significantly degraded by noise. 



1 Introduction 



Automatic speaker recognition is an important, emerging technology with 
many potential applications in commerce and business, security, surveil- 
lance etc. JCampbell 19971) . Recent attention in speaker recognition has fo- 
cussed on the use of subband processing, whereby the wideband signal 
is fed to a bank of bandpass filters to give a set of time-varying out- 
puts, which are individually processed before using multiple classifier tech- 
niques to produce a combined, overall decision ( Racier and Bonastrel lTWTl 
KOOOI |5ivakumaran, Ariyaeeinia, and Hewitt 19981 ), Higgins, Damper, and Har- 
ris 1999. Because the subband signals vary slowly relative to the wideband sig- 
nal, the problem of representing them by some data model should be simplified 
([f’inan. Damper, and Sapeluk 2001 |. Although our previous work has demon- 
strated performance gains from subband processing with clean speech, this 
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used fairly rudimentary recognition and fusion (or ‘recombination’) techniques — 
dynamic time warping and sum or product fusion rules respectively, as well as 
a single spoken digit (seven). The question which then arises and which is ad- 
dressed in this paper is: Can subband processing still deliver performance gains 
when using state-of-the-art recognition and fusion techniques and a wider va- 
riety of speech materials? To answer this, we have used hidden Markov mod- 
els (HMMs) and neural network recombination with spoken digits seven and 
nine. 

The subband, or multiple classifier, approach has also become pop- 
ular in recent years in speech recognition (IHourlard and Dupont 19961 
ITibrewaia and Hermansky 19971 IMorris, Hagen, and Hourlard 19951) . In this re- 
lated area, the main motivation has been to achieve robust recognition in the 
face of noise. The key idea is that the recombination process allows the overall 
decision to be made taking into account any noise contaminating one or more of 
the partial bands. Hence, as well as investigating subband speaker recognition 
from clean speech, we also report on work in which narrow-band noise is added 
to test utterances. 

The remainder of this paper is organised as follows. Section El gives essential 
background on the problem of speaker recognition. Section 0 briefly describes 
the speech database used. Section 0 describes the subband processing system, 
including details of the feature extraction and data modelling. In Section 0 we 
detail the various recombination techniques studied, with results presented in 
Section 0 Finally, Section 0 concludes. 



2 The Speaker Recognition Problem 



The speaker recognition problem can be divided into verification and iden- 
tification, each of which may in turn be text-dependent or text-independent 
(ICampbeii 11)1)71 ll'urul In verification, the aim is to determine if a given 

utterance was produced by a claimed speaker. This is most directly done by 
testing the utterance against the model of the claimed speaker, comparing the 
score to a threshold, and deciding on the basis of this comparison whether or not 
to accept the claimant. In identification, the aim is to determine which speaker 
from among a known group produced an utterance. The test utterance is scored 
against all possible speaker models, and that with the best score determines the 
speaker identity. Of the two tasks, identification is generally accepted to be the 
harder, especially for large speaker populations JPoddington 19851 P- 1660). 

In text-independent recognition, there are no limits on the vocabulary em- 
ployed by speakers. This is in contrast to text-dependent recognition, where the 
tested utterance comes from a set of predetermined words or phrases. As text- 
dependent recognition only models the speaker for a limited set of speech sounds 
(‘phones’) in a fixed context, it generally achieves higher performance than text- 
independent recognition, which must model a speaker for a variety of phones 
and contexts. 
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Since identification is simply a matter of selecting among speakers, typically 
using a minimum distance decision rule, performance is easily quantified by a 
single measure. There are only two possible outcomes — correct or incorrect — 
so that the identification error fully specifies the situation. Things are a little 
more complicated with verification where the system has to accept or reject a 
claimed speaker identity in the face of potential impersonation. Hence, there are 
four possible outcomes, of which two — false acceptance and false rejection — are 
errors. Thus, some decision threshold must be set which effects a balance between 
the two types of error. Because of the slightly increased difficulty of quantifying 
error in verification, we focus exclusively on identification in this paper. Also, 
we restrict attention to text-dependent recognition because of its more obvious 
applicability (Poddington 1985] p. 1660). 



3 Speech Database 



We use the text-dependent British Telecom Millar database, specifically designed 
and recorded for text-dependent speaker recognition research. It consists of 60 
(46 male and 14 female) native English speakers saying the digits one to nine, 
zero, nought and oh 25 times each. Recordings were made in 5 sessions spaced 
over 3 months, to capture the variation in speaker’s voices over time which is an 
important aspect of speaker recognition (Ihurui 197411 . 

The speech was recorded in a quiet environment using a high-quality micro- 
phone and a sampling rate of 20 kHz with 16-bit resolution. The speech data 
used here were downsampled to 8 kHz sampling rate as this reduces simulation 
times and is more typical of the data which might be encountered in a real ap- 
plication. For the work reported here, we consider utterances seven and nine, 
i.e., text-dependent identification. Data from the first two sessions (i.e., 10 rep- 
etitions of each word) were used for training and data from the remaining three 
sessions (15 repetitions) were used for testing. 

In order to achieve good performance, manual editing of the start and end 
points of each utterance was necessary. This was done by author JEH. This was 
a time-consuming task: For a fully automatic system, we would obviously need 
to implement a high performance automatic endpointing algorithm. 

As so far described, the speech data are essentially noise-free. However, a 
major motivation behind subband processing has been the prospect of achiev- 
ing good recognition performance in the presence of narrowband noise. Such 
noise affects the entire wideband model but only a small number of subbands. 
Hence, we have also conducted identification tests with added noise. Follow- 
ing Besacier and Bonastre (2000| |, Gaussian noise was filtered using a 6th-order 
Butterworth filter with centre frequency 987 Hz and bandwidth 365 Hz. It was 
added to the test tokens at a signal-to-noise ratio of 10 dB. Figure [H shows typ- 
ical power spectra of the wideband speech signal (Fig. Oj) and of the subband 
signals for the 4 subband case described in the next section (Fig. ED. It can be 
seen that the middle two bands are affected by the noise whereas the low- and 
high-frequency bands are relatively clean. 
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Fig. 1. Typical spectra of a test utterance seven with added narrowband noise: 
(a) wideband; (b) subband spectra for N = A. In (b), the two mid- frequency bands 
are contaminated but the low- and high-frequency bands are relatively clean. 



4 Subband Processing 



Figure 0shows a schematic of the subband system used here to model each indi- 
vidual speaker. We use 2, 4, 6, 8 or 10 bandpass filters (6th-order Butterworth) . 
Filter centre frequencies are equally spaced on the psychophysically-motivated 
mel scale (IStevens and Volkmann 19401 , and feature extraction is performed on 
each subband. The resulting sequences of feature vectors are passed on to each 
subband’s HMM recognition algorithm. 




to fusion 
^ subsystem 



Fig. 2. Schematic diagram of the subband processing subsystem. Each subband (filter) 
has its own HMM recogniser. In this work, we use either 2, 4, 6, 8 or 10 6th-order 
Butterworth filters with centre frequencies equally spaced on the mel scale. There is 
one such subsystem for each speaker. 
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Here, we model the speech signal by extracting features on a frame-by-frame 
basis. The features are cepstral coefficients, obtained from linear prediction 
( [Market and Gray 197fH ). Cepstral analysis is motivated by, and designed for, 
problems centred on voiced speech ( [Deller, Hroakis, and Hansen 1991?j ) . However, 
it also works well for unvoiced sounds. Cepstral coefficients have been used exten- 
sively in speaker recognition (IFuriii 19811 [Reynolds and Rose 1995| ), partly be- 
cause a simple recursive relation exists that approximately transforms the easily- 
obtained linear prediction coefficients into ‘pseudo’ cepstral ones (lAta.l 1D74[I . 
The analysis frame was 16 ms long, Hamming windowed and overlapping by 
50%. The first 12 coefficients were used, excluding the zeroth cepstral coefficient 
(as is usual). 

Subsequently, we have to derive recognition models for the utterances of 
the different speakers, for which we use the popular hidden Markov models. 
HMMs are powerful statistical models of sequential data that have been used 
extensively for many speech applications IjRabiner 1989|l . They embody an un- 
derlying (hidden) stochastic process that can only be observed through a set of 
stochastic processes that produces an observation sequence. In speech processing 
applications, this observation sequence is the series of feature vectors that have 
been extracted from an utterance. 

Discrete HMMs were used with 4 states for word seven and 3 states for 
word nine, plus a start and end state in each case. This structure was found 
to give best results in preliminary tests. Apart from self-loops (staying in the 
same state), only left-to-right transitions are allowed. Speech frames were vec- 
tor quantised, and each HMM has its own linear codebook of size 32. There- 
fore, in the wideband case there are 60 codebooks (equal to the number 
of speakers) and in the subband system there are 60 x iV codebooks (where 
N is the number of subbands), which were constructed using a Euclidean 
distance metric. HMMs were trained and tested using the HTK software of 
Young, Kershaw, Odell, Ollason, Valtchev, and Woodland (2000||. 



5 Subband Recombination 



Our earlier work has used the sum and product rules specified by 
Kittler, Hatef, Duin, and Matas (1998). Here, we explore the use of artificial 



neural networks (ANNs) for subband recombination. In this work, the HMMs 
deliver log likelihood values, so that sum-rule fusion corresponds to taking prod- 
ucts of likelihoods. Under assumptions of conditional independence and equal 
priors, this strategy is optimal (e.g. orris, Hagen, and Hourlard 1999| ). Using 
this rule, the identified speaker, i, is that for whom: 

N 

i = argmax [y®] = argmax ^ logL® 1 < s < S' = 60 (1) 

n—1 



where N is the number of classifiers (subbands), L® is the likelihood that classi- 
fier n and model speaker s produced the observed data sequence, and y® is the 
recombined (final) score for speaker s from the set of S speakers. 
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The formulation in equation du is linear with constant (unity) weights. 
However, according to jBourlard and Dupont (1996 1 in their work on subband 
speech recognition: . . it is often argued that the recombination mechanism 

should be nonlinear” (p. 427). Also, the assumption of conditional independence 
is unsatisfactory. Accordingly, [Hourlard and Dupontj used a multilayer percep- 
tron (MLP) trained to estimate posterior probabilities of speech units (HMM 
states, phones, syllables or words) given the log-likelihoods of all subbands and all 
speech units. It is also intuitively-attractive for a recombination scheme to have 
variable weights (|(Jkawa, INakajima, and bhirai 193^ and the MLP offers this. 

Hence, MLPs have been used for the ANN recombination in this work (see 
Pjishop 19951 for relevant background) . Various MLP recombination structures 
were considered initially, namely: 



Single ‘global’ MLP: this structure would have taken each of N subbands 
from each of 60 speaker models to give 60 x N inputs and would have needed 
an output layer capable of encoding S = 60 speaker labels. We did not be- 
lieve there were sufficient data available to train such a large ANN, so this 
option was not pursued. 

Single ‘local’ MLP: this structure has N inputs and only a single output. It 
is trained on outputs from all 60 speaker subsystems (as in Fig. El). During 
test, output from each speaker subsystem is passed in turn to the MLP, and 
the identified speaker is that producing the largest output activation. 

Multiple ‘local’ MLPs: here, there are again N inputs and a single output, 
but there are 60 separate ANNs — one per speaker. The decision rule is as 
for the single local MLP. 



Because it performed best in preliminary tests, only the single local structure 
was used here. This is most likely a consequence of the much larger data set used 
in training the single global MLP. In particular, the multiple MLPs each ‘saw’ 
only 10 positive examples as compared to 590 negative examples. 

Each MLP was trained 10 times from different initial points in the search 
space, with the initial weights drawn from a zero-mean, unit-variance isotropic 
Gaussian distribution. The single output had a logistic activation function. For 
the results reported here, all MLPs had a single hidden layer of five tanh nodes. It 
was found that using either 10 or 15 hidden nodes did not significantly affect the 
results. Training minimised the cross-entropy error function using a conjugate- 
gradient algorithm. Outputs were trained to 0 or 1, with the latter indicating that 
the MLP classified the utterance as belonging to the speaker model. A weight 
decay scheme (with a — 0.2) was used to prevent over-training. The order of 
the training data was randomised to avoid bias in the learning (in terms of all 
the positive examples being presented in a single block) . Training used noise-free 
speech data only. 

Input data were scaled to be in the unit-interval in each input axis. This 
was to make the weight initialisation easier (as above) and also to avoid slow 
convergence of the weights in the presence of highly imbalanced data (less than 
2% of the examples were positive). Without this scaling it was found that the 
weights could not converge in the number of iterations allowed. 
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6 Results 

All systems tested gave approximately 100% correct identification on the clean 
speech data. Figure 0 shows the results obtained for the noisy speech with the 
wideband system, with the sum of log likelihood fusion rule, and with the MLP. 
The latter two are depicted as a function of the number of subbands. Error bars 
are shown for the MLP, as a measure of the variability of the results starting 
from different random initial weight settings. 




Fig. 3. Results as a function of the number of subbands for test utterances seven (a) 
and nine (b). 



There are three notable aspects to these results: 

1. Subband processing delivers enormous performance advantages, raising iden- 
tification from just below 60% correct for the wideband system with word 
seven to approximately 95% correct, and from below 20% correct for the 
wideband system with word nine to 100% correct. 

2. There is little difference in the performance of the two fusion techniques, 
suggesting that the conditional independence assumption in equation m is 
reasonable. 

3. There is widespread variation in the pattern of results for the two different 
words. For seven, the wideband result is much higher than for nine, yet the 
subband system achieves 100% correct identification for nine but not for 
seven. Best results for seven are obtained with 10 subbands for both fusion 
methods whereas best results for nine are obtained with 4-10 subbands using 
equation O- 

7 Conclusions and Future Work 

In this paper, we have extended our earlier work on subband speaker recogni- 
tion using multiple classifier techniques. We have studied speaker identification 
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both in noise-free conditions and with narrowband noise added to the test ut- 
terances only (training remained noise- free). By including word nine in addition 
to word seven for 60 speakers, we have tested a wider range of speech materials 
than previously. 

For the clean speech, all systems tested achieved either 100% correct iden- 
tification or very close to 100%. With noisy speech, the sum rule of recombi- 
nation working on log likelihoods gave comparable results to the MLP fusion. 
The best MLP in preliminary tests was the single local version, which had the 
same number of inputs as the number of subbands and was trained on the entire 
training set. 

Our priorities for future work are to include other fusion techniques in our 
performance comparisons, to explore other kinds of noise contamination, to at- 
tempt to understand the difference in the pattern of results for the two dif- 
ferent words studied here, and beyond that to study all ten spoken digits in 
the database. 
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Abstract. In this paper we discnss classifier architectures to categorize 
time series. Three different architectures for the fusion of local classifier 
decisions are presented and applied to classify recordings of cricket songs. 
Different features from local time windows are extracted automatically 
from the waveform of the sound patterns. These features are used to 
classify the whole time series. We present results for all three classifier 
architectures on a data set of 28 different categories. 



1 Introduction 

The classification of time series is the topic of this paper. In real world applica- 
tions information or features extracted from time series are used for the catego- 
rization. For example, in medical diagnosis a patient may be classified into one 
of two or more classes using an electrocardiogram (EGG) recording. Another 
pattern recognition problem may the identification of an individual based on 
its speech recording. One difficulty with the classification of time series is that 
the number of measurements is typically large in comparison to the number 
of objects and varies from recording to recording. To overcome these problems 
some kind of preprocessing and feature extraction on these time series has to be 
performed. 

In principle there are two approaches of feature extraction for time series: 

1. Global features. These features based on the information in the whole time 
series, e.g. the mean frequency, mean energy, etc. 

2. Local features. These are derived from subsets of the whole time series, 

which are usually defined through a local time windows IT*. In this type of 
feature extraction a set of features is calculated within the window IF*. The 
window is then moved by a time step At into and the next set of 

features is calculated. Moving the window over the whole time series leads 
to a sequence of feature vectors. 

In this paper we focus on the classification of time series based on local fea- 
tures, i.e. on sequences of locally derived feature vectors. The paper is organized 
as follows: In Section 2 we present three different classifier architectures for the 
fusion of local decisions. For the classification of feature vectors we use the fuzzy- 
k -nearest-neighbour approach, which is described in Section 3. In Section 4 we 
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present an application of these classifiers in the domain of bioacoustics (the clas- 
sification of cricket songs). We present and discuss the classification results for 
the proposed fusion architectures on this dataset in Section 5. 

2 Classification Fusion for Local Features 

In this section we propose three different types of classifier architectures for the 
fusion of local features and decisions based on these local features. This situation 
is illustrated in Figure Q] Here a window W* covering a small part of the time 
series is moved over the whole time series. For each window W*, t = 1, ...,T a 
set of p features F* S i = 1, ...,p and di G IN, is extracted from the time 
series. Typically T, the number of time windows varies from time series to time 
series. 



window W‘ 



time series 




Fig. 1. A set of p features Fl, ...,Fp is extracted from a local time window located 
at time t. 



A. Architecture CDT (see Figure Efi) 

In this architecture the classification of the time series is performed in the 
following three steps: 

1.) Classification of single feature vectors (C-step) 

For each feature F*, i = derived from the time window W* a 

classifier is given through a mapping (see Section 3) 

c, : (1) 

where the set A is defined through 

i 

Z\:={(gi,...,®)e = (2) 

i=l 

Here I is the number of classes. Thus for time window W*, the p classi- 
fication results are ci{Fl), ...,Cp{Fp). 
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2. ) Decision fusion of the local decisions (D-step) 

The classification results are combined into a decision C\ G A for the 
time window through a fusion mapping T : AP — ?> A 

:=^(ci(Ff Cp(F^)), t = (3) 

which calculates the fused classification result of the p different decisions. 

3. ) Temporal fusion of decisions over the whole time series (T-step) 

The classification for the whole set of time windows W*, t = is 

given through 

= (4) 

again T : Z\^ — > Z\ is a fusion mapping. 

B. Architecture DCT (see Figure |2t>) 

Here the classification of the whole time series is determined through the 
classification of the fused features and decision fusion over the whole time 
series: 

1. ) Data fusion of feature vectors (D-step) 

Here the extracted features ...,Fp in the time window W* are simply 
concatenated into a single feature vector F* = {F^, F*) G IR^, with 

P = Y.Ud,- 

2. ) Classification (C-step) 

The combined feature vector F* is classified into G A using a classifier 
mapping c‘ : IR^ — >■ A. 

C|,:=c‘(F‘) (5) 

3. ) Temporal fusion of decisions over the whole time series (T-step) 

Here 



C% = F{C],,.:,Cl) ( 6 ) 

is the classification result for the whole set of time windows t = 
1 T 

C. Architecture CTD (see Figure Et) 

Here, the final classification result is determined through temporal fusion 
followed by decision fusion. 

1. ) Each of the p feature vectors ...,Fp within W* is classified (C-step) 

F* ^ c,{F!) G (7) 

2. ) Temporal fusion based on each feature j = 1, ...,p (T-step) 

C^c = Picj{F}),-,c,{Ff))GA ( 8 ) 

again is a fusion mapping T : A^ — ^ A. 
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c(Fl) 




C, 



Final 

Classification 




' — Temporal fusion t= 1, T 



(a) CDT - Classification, Decision fusion, Temporal fu- 
sion. 



1 t T 




(b) DCT - Data fusion, Classification, Temporal fusion. 



c(f;) 

1 t T Decision fusion 




(c) CTD - Classification, Temporal fusion. Decision fu- 
sion. 



Fig. 2. Classifier architectures for the classification of time series 



382 



C. Dietrich, F. Schwenker, and G. Palm 



3.) Decision fusion over all p features Cq (D-step) 

C°c = T{Ch,...,C^c) ( 9 ) 

here ^ is a fusion mapping T : ^ A. 

For the integration different fusion mappings may be considered. We now 
consider the decision profile for an input x" which contains the individual 

classifier outputs. Here n denotes the number of feature vectors which is equal 
to the number of classifier outputs and I the number of categories. 







1 


e](xi) ... 


1 


DP{x\. 


.,x^ = 


el(x®) ... 


e®(x®) ... 


ej(x®) 






_e5®(x") . . . 


e”(x") ... 


er(x-)_ 



For temporal fusion e* G A, i = 1,...,T (see Eq.^El EJ denotes the classifier 
output of the j-th time window whereas for decision fusion e® G A, i = 1, ...,p 
(see Eq. OllSI) denotes the output of the *-th classifier. Then e® is the evidence for 
class j obtained from the z-th classifier/time window. Temporal fusion and the 
decision fusion is done by average fusion pj or symmetrical probabilistic fusion. 
For n decision vectors e^,...,e" G A the average of the classification results is 
given by 







( 11 ) 



A probabilistic approach to combine classification results is to apply Bayes’ 
rule (3 under the assumption that the classification results are independent. 

For this the posterior probability P{ujj\x^, ...,x") for the class ujj has to be 
approximated, given the evidence readings (classifications) ej, ..., ef for the class 
uij, which will be interpreted as posterior probabilities P{ujj\x^), ...,P{uij\x^). 
Then the classifier output is given by 



P{e\...,e^)j := P{ujj\x^ 



®) +ej(x\...,x”) 



(12) 



where e^(x^, ..., x”) is the error made by the classifier ensemble. 

Let O(t 0 j\x^, ...,x”) be the posterior odds. Then the posterior probability is 
given by [H| 



P{u 



1 = 



0 (, 



UJPX , .... X 



‘) 



1 + 0(a;|x^, ..., x") 



= 1-(1 



P{u 



P(-.Wj|xi,...,x") 






(13) 



Assuming that the conditional probabilities P{x‘^\ujj) for i ^ i' are independent 
of P(x® \u}j) leads to the product rule 



P{ujj\x\...,x^) = aY[P{ujj)P{x^\ujj) 

i^l 



(14) 
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where a is a normalizing constant to be computed by requiring that that Eq. 
sum to unity (over i) jS]. For a multi class problem the posterior odds are then 
given by 









= a 



P{u,) fr P{P\u;,) 



n 






(15) 



Integrating the posterior odds (Eq. I I 51) into Eq. Eland applying the Bayes’ rule 
leads to the symmetrical probabilistic fusion function 






= 1-1 



(‘ 



P{i0,) A P(u;,|P) 

Ph^j) Ph^^j\P) P{^j) 



(16) 



where P{u!j) denotes the class probability for the class uij which is set to j in 
our numerical experiments. 



3 Classification with the fc-Nearest-Neighbour Rule 

One of the most elegant and simplest classification techniques is the k-nearest- 
neighbour rule |2|. The classifier searches for the k nearest neighbours among 
a set of m prototypes {x^, G and an input vector x G IR'^. The k 

nearest neighbours are calculated utilizing the Lp-norm. 

<fj{x,x^) = ||a;-a;^||p= |x, - a;^ | p) (17) 

between x and cn . The class which occures most often among the k nearest 
neighbours is the classification result. 

Let I be the number of classes. To determine the membership of an input 
vector X to each class, a fuzzy-k-nearest-neighbour classifier is applied cni. 
A fuzzy classifier Af is a mapping Af : IR'^ — )> [0,1]*, with output J\f{x) = 
{Si{x),...,Si{x)) G Z\. 

For each class uj = 1, ..., Het rrii^ be the number of prototypes with classlabel 
uj. Then for input x and each class uj there is a sequence (r“)™“i with — 
a^llp •■■I < — xjjp. Here denotes the number of prototypes of class uj. 

The k nearest neighbours {k < mS) of x to class uj are given by 

ATr(a;):=K^...,x"i^}. (18) 



Let 



5^{x) = 



E 






a > 0 



(19) 
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be the support for the hypothesis that the true label of a; is w. The parameter 
a is used to determine how to grade low values of ||a: — Xi\\p. 

After normalisation by 5j{x) := call the classifier outputs soft 

labels |S|. For input x the classification result Af{x) := {Si{x), ...,Si{x)) G Z\ is 
containing the membership of x for each class. 

After normalisation with 



Aj 



Si=l 



(20) 



we can restrict Aj{x) without loss of generality within the interval [0, 1] and call 
the classifier outputs soft labels 0. 

For input x the classification result J\f{x) = (Ai{x), A[{x)) is containing 
the membership of x for each class. The parameter k determines the fuzzyness 
of the classification result. In our numerical experiments we set p to 1, a to 0.01, 
f to 100 and k to 3. 



4 Application 

We present results achieved by testing the algorithms on a dataset which contains 
sound patterns from 28 different cricket species. The dataset contains recordings 
of 3 or 4 different individuals per species. The recordings are from Thailand (used 
by Ingrisch |3I4| 1 and from Ecuador (used in the doctoral thesis by Nischk 0). 
The sound patterns are stored in the standard WAV- format (44.100 Hz sampling 
frequency, 16 Bit sampling accuracy). 

The cricket songs consist of sequences of sound patterns (chirps). Based on 
these chirps (sequences of so-called pulses) the crickets may be classified I1I7I . 
Therefore we analyse the structure of single pulses in the cricket songs, and so 
the first processing step is the pulse detection. 

4.1 Pulse Detection 

The position of the pulses are located using an modified algorithm of Rabiner 
and Sambur P], which uses the signal’s energy and two thresholds (see Figure 
Ob) to detect the onsets (start position) and offsets (stop position) of the signals. 
A modified algorithm is applied to extract single pulses of the cricket songs P . 
These pulses are used to calculate the features. 

4.2 Local Features 

The following three local features are extracted from the single pulses and then 
used as inputs for the classifiers: 

1. Pulse length: Let n be the amount of pulses. Furthermore let Xi be the 
onset of the i-th pulse and Hi be the offset of the i-th pulse. Then the length 
Li is extracted by fj,i — Xi, i = 1 , . . . , n — d, where d is the amount of 
dimensions used for the pulse distances. 
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nal segmentation 



Fig. 3. Waveform and energy of the species Noctitrella glabra (time window 1500 ms). 



2. Pulse distances: Let A = (Z\i, Z\„_i) be the distances between two 
pulses calculated by 

= j G 1}. (21) 

Then the pulse distances are extracted using a d— tuple encoding scheme 
producing n — d datapoints Ti G IR'^ 

Ti := (Ai, L\i+i, Ai+d-i) G IR"^, * = 1, ■■■, n — d. (22) 

3. Pulse frequency: The pulse frequency Fi, i = — d is extracted 

from the spectra of the single pulses. The algoritm searches the frequency 
band with the highest energy. 

5 Results and Discussion 

Because of the limited data sets 108 sound recordings (4 records for 24 species 
and 3 records for 4 species) we utilize the cross validation method to evaluate 
the classifiers. The training set has been used to design the classifiers, and the 
test set for testing the performance of the classification task. The test is done 
in a k-fold cross validation test with k = A cycles using 2-3 records of each 
species for training and 1 record of each species for the classification test. In 
the numerical evaluation 25 different cross validation tests have been made. The 
training and test records are randomly splitted and are always recordings from 
different individuals. Table E shows the cross-validation results for the single 
features. 

The best feature for the classification are the distances between pulses T^. 
Including other features into the classification procedure may enhance the clas- 
sification rate. Table El and 0 depicts the classification rates of the previously 
introduced classifier architectures for the three local features. For the DCT archi- 
tecture the fusion functions average fusion and symmetrical probabilistic fusion 
(see Eq. El are only be used for the temporal fusion (see Eq. 0, because 
data fusion is the first fusion step. 

The best classification rate (6.924 % error) for the described application 
is achieved with the DCT architecture and the average function for temporal 
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Table 1. Classification performance of the single local features after temporal fusion 
through average fusion. 



feature 


error % 


Pulse length (Li) 
Pulse frequency {Fi) 
Time structure (Ti) 


82.068 ± 2.178 
72.879 ± 2.448 
18.396 ± 3.086 



Table 2. Classification results of locally fused features for the DCT architecture de- 
pendent on the fusion function {AFt temporal fusion with the average function and 
SPFt temporal fusion with the symmetrical probabilistic function). 



architecture 


AFt 


SPFt 


DCT 


6.924 
± 1.655 


7.091 
± 1.711 



fusion (AFt). For this architecture the symmetrical probabilistic fusion over 
time (SPFt) leads almost to the same results (7.091 %, see Table E|. 

For decision fusion and temporal fusion with average fusion {AFjj/AFt) 
the CDT architecture is equivalent to the CTD architecture (classification error 
8.595 %). For decision fusion with average fusion and temporal fusion with the 
symmetrical probabilistic function {AFjy/SPFr) the CDT architecture outper- 
forms the CTD architecture. The reason for that is that the symmetrical prob- 
abilistic function is sensitive to errors in the probability estimations P^ujjlx'^). 
These probability estimations may be better if the classifier results are combined 
through decision fusion (CDT architecture) before temporal fusion is applied 
(see Table □). The same effect is observed for the AFjj/SPFt fusion function 
combination. 

We observed that in our numerical experiments the CTD architecture out- 
performs the CDT architecture. 

It seems that symmetrical probabilistic fusion is sensitive for temporal fusion, 
because the product P{ei , ..., e„) is equal to zero if just a single decision e* in the 



Table 3. Classification results of locally fused features for the CDT and CTD archi- 
tecture dependent on the fusion function (AF average fusion (see Eq. I1 1 1 and SPF 
symmetrical probabilistic fusion (see Eq. null '). The indices D and T indicate if the 
fusion function is used for D decision fusion or T temporal fusion. 



architecture 


AFd {AFt 


AFd/SPFt 


SPFd/AFt 


SPFd/SPFt 


CDT 


8.595 
± 1.776 


8.409 
± 1.975 


7.314 
± 1.656 


7.073 
± 1.816 


CTD 


8.595 
± 1.776 


15.296 
± 2.737 


7.407 
± 1.792 


14.424 
± 2.528 
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sequence is equal to zero (see Eq. GED- But for decision fusion the symmetrical 
probabilistic function outperforms average fusion in our application (see TableEI). 
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Abstract. There are problems in pattern recognition where the output 
of a system is a sequence of classes rather than a single class. A 
well-known example is handwritten sentence recognition. In order to 
make those problems amenable to classifier combination techniques, 
an algorithm for sequence alignment must be provided. The present 
paper describes such an algorithm. The algorithm extends an earlier 
method by including information about the location of each pattern 
in a sequence. The proposed approach is evaluated in the context 
of a system for handwritten sentence recognition. It is demonstrated 
through experiments that by the use of positional information the 
computationally expensive process of multiple sequence alignment can 
be significantly sped up without loosing recognition accuracy. 

Keywords: multiple classifier combination, multiple sequence align- 
ment, string edit distance, positional information, handwriting 
recognition. 



1 Introduction 

Multiple classifier combination has become a very active area of research P . The 
motivation behind the activities in this area is based on the observation that 
classification errors can be often corrected if an ensemble of classifiers rather 
than a single classification method is used for a given task. However, most of 
the approaches to classifier combination aim at the situation where the target 
of each classifier, and the whole system, is just a single class. For this kind of 
problem, a multitude of combination techniques have been proposed, including 
product, sum, median, maximum and minimum of a posteriori probabilities and 
related quantities | 2 |, as well as voting and trainable combiners 

But there are many pattern recognition problems, where the desired result 
is a sequence of classes, rather than just a single class. An example is handwrit- 
ten sentence recognition PEd. Here the goal of the recognizer is to produce a 
sequence of words, or classes, that represent the handwritten text. A difficult 
problem in the recognition of handwritten sentences is the segmentation of a 
line of text into individual words. Such a segmentation can be performed in an 
explicit way prior to recognition pni, or it can be integrated into the recognizer 
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0. In either case, both recognition and segmentation are prone to errors. There- 
fore, we can neither expect that the identity of a word delivered by a recognizer 
is correct, nor that the correct number of words is reported. Consequently, given 
N sequences of words, each delivered by a different recognizer, an alignment of 
the N sequences needs to be done, before ’conventional’ classifier combination 
techniques, such as the ones mentioned above, can be applied. 

A well-known method for sequence alignment is based on the edit distance 
of sequences |E|. The method reported in ^ can be used for the alignment of 
two sequences. An extension to the case of A^ > 2 sequences is described in 
P]. This procedure has been used for the postprocessing of OCR results |H]j. A 
problem with the alignment procedure described in jSj is its high computational 
complexity. Let N denote the number of sequences and n the maximum sequence 
length. Then the algorithm described in jSj needs 0{n^) time and space. In jTT!] 
a suboptimal version of the procedure described in |2j was used, which explores 
only part of the A^-dimensional search space. However, a drawback is that the 
portion of the search space that is actually considered has to be specified a 
priori. I.e., it is independent of the actual input data. In the present paper, 
another version of the algorithm described in |2j is proposed. It is distinguished 
by the fact that it uses positional information about the words delivered by the 
recognizers. If the words in an TV-tuple under consideration differ significantly in 
their spatial location in a line of handwritten text, they are not considered any 
longer as potential candidates for alignment. This leads to a quite substantial 
pruning of the search space. But in contrast with defining the part of the space 
that is actually explored in a fixed manner beforehand, the proposed method 
dynamically adapts itself to the most promising regions in the search space 
based on the actual input data. 

The present paper is organized as follows. In Section |2| the problem of se- 
quence alignment for multiple classifier combination is stated formally, and the 
proposed solution is described. Then in Section 0 experimental results obtained 
with the new method in a system for handwritten text recognition are presented. 
Finally, the paper is summarized in Section 0 and conclusions are drawn. 



2 Problem Statement and Proposed Solution 

Assume we have N recognizers, i?i, . . . , i?Ar, where recognizer Ri yields a se- 
quence of words (or classes) 



= ( 1 ) 

as output; i = 1, . . . , A^. If the number of words output by each Ri were 
the same, i.e., ni = U 2 = ■ ■ ■ = tin = n, then ’traditional’ classifier combi- 
nation techniques, such as the ones referenced in Section I3 could be directly 
applied. That is, the most plausible word could be determined for each position 
j based on (wj, . . . , w^),j = 1, . . . , n. In practice, however, due to segmentation 
errors, we have to anticipate that all ni, . . . un may be different from each other. 
Hence, before any of the combination methods mentioned in Section 0 can be 
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Jkx\)lcjy^ CWCX C^OlcaOU^ CD'£i!>oov . 



Fig. 1. Example of a text to be recognized. 

Table 1. Output for the text of FigQ output by three different recognizers. 

with lavish and suitably gaudy colour 
with lavish and suitably gaudy colour . 
with lavish and spite they gaudy about . 



Table 2. Alignment for the text of FigQ produced by three different recognizers. 





1. 2. 3. 


4. 


5. 


6. 


a.) 


with lavish and 


suitably 


€ 


gaudy 


b.) 


with lavish and 


suitably 


€ 


gaudy 


c.) 


with lavish and 


spite 


they 


gaudy 



7. 8. 

colour € 
colour . 
about . 



applied, an alignment between all JV sequences wl . . . . . . , has 

to be performed. 

As an example, Fig. [D shows an image of a word sequence, for which the 
recognition result delivered by three different recognizers are listed in Table dEI 
(Notice that the results of the first and second classifier differ only in the period 
at the end of the word sequence.) In Table 0an optimal alignment of the three 
sequences is shown. The symbol e denotes the empty word that is inserted in the 
sequence of recognizer Ri, if Rj,i ^ j, outputs a word for which no corresponding 
word is reported by Ri. By an optimal alignment we mean an alignment that 
minimizes the total number of e’s as well as the number of word substitutions 
involved. It can be seen that for the first three positions a perfect alignment 
with no word substitutions or insertions of the empty word can be found. At the 
fourth and seventh position word substitutions have to be performed, while the 
empty word has to be inserted at the fifth position and at the end. 

Computing an optimal alignment between word sequences s^, . . . , is equiv- 
alent to computing a sequence of words, s, that has, among all possible sequences 
of words from the underlying dictionary V, the minimum average edit distance 
to s^, . . . , . Formally, given s^, . . . ,s^ we want to find a sequence of words, s, 

that minimizes 



( 2 ) 

i=l 

where d(s,s®) is the edit distance between s and sb A sequence of words 
with this property is also known as a generalized median of set {s 
Next, we review the algorithm for generalized median computation that was de- 
scribed in [ 3 | . It is an extension of the classical algorithm for string edit distance 



^ This example is based on the recognizers described in Sectional 
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computation proposed in 0. In order to model recognition and segmentation 
errors, three types of edit operations are considered, namely, the deletion, inser- 
tion, and substitution of a word. Let w — >■ e and e — >■ w denote the deletion and 
insertion of word w, respectively, and w ^ w' the substitution of w hy w' . Each 
of these edit operations is assigned a cost, which is a real non-negative number. 
Let c{w — >■ e), c(e — >■ w) and c{w — >■ w') denote the cost of edit operations w — >■ e, 
e ^ w and w — >■ w', respectively. In the remainder of this paper, we’ll use the 
following costs: 



f 0, if w = w' 
\ 1, otherwise 



( 3 ) 



Notice that w ^ w' includes, as a special case w = e or w' = e. Hence the 
cost of deleting or inserting word w as well as the cost of substituting w by w', 
where w ^ w', is equal to one. (Note that edit operation e — > e will be excluded 
from our considerations.) 

To find an optimal alignment of sequences s^, . . . , , i.e., to compute their 

generalized median, for an A^-tuple of words {wj^, . . . the word v G V' = 

V U {e} causing the minimal edit costs needs to be determined. Formally, v is 
defined by means of the following equation: 



V = S{wl^ , . . . , wfl) = mm(c(u ^ ) + c{v ^ wD + . . . + c{v ^ w^J). (4) 



Having defined function S in this way, the optimal alignment of N sequences 
s^, . . . can be computed by means of dynamic programming j2], similarly to 
string edit distance computation 0 in an fV-dimensional array. To simplify the 
notation in this paper, the following algorithm addresses only the case N = 3. 
But the generalization to arbitrary N is straightforward, 
initialization: 

dyo,o ^ d; I5 • ■ • ) 

do,j,o=j; j = 0,l,...,ri2 (5) 

do,o,k = k; A: = 0,1, . . . ,U 3 

iteration; 



d 






min < 



di—lj—l^k — 
di—l^j—l^k 
^i—l,j,k—l 
di— 1 

diJ — l^k—1 



1 



di^j — l^k 
^ di^j^k — 1 



d(wl,w^,e) 

S(u>l,e,e) 

S(e,wf,wl) 

S(€,wJ,e) 



1 < I < ni ) 

1 < j < > 

1 < A: < ri3 J 



( 6 ) 



end: 



if {i = ni) A {j = 712 ) A (A: = ns) 



( 7 ) 
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Table 3. Minimal edit distance and the corresponding path through the search space 
for the example in Tabs. Q and El and Fig. Q 



do, 0,0 = 0 

= 0 = do, 0,0 + d(with, with, with) 
d2,2,2 = 0 = di,i,i + ^(lavish, lavish, lavish) 
da, 3, 3 = 0 = d2,2,2 + d(and, and, and) 
d4,4^4 = 1 = da, 3^3 + ^(suitably , suitably, spite) 
d 4 , 4,5 = 2 = d4,4,4 + < 5 (e, e,they) 
ds,5,6 = 2 = d 4 , 4,5 + < 5 (gaudy, gaudy, gaudy) 
do, 6 , 7 = 3 = ds,5,6 + ^(colour, colour, about) 
do, 7, 8 = 4 = do, 0,7 + d(e, ., .) 



The path through the search space that leads from c?o,o,o to with 

minimum cost defines both the optimal alignment of and their gener- 

alized median. As an example, in Table Elthe optimal path trough the three- 
dimensional search space for the example given in Table |3 is shown. At the 
positions marked with an arrow, edit costs are caused. At these positions, word 
substitutions occur or the empty word is inserted to achieve an optimal align- 
ment. 

Obviously it is very unlikely that a word at the beginning of a sequence 
corresponds to a word at the end of another sequence. More generally, only 
words that occur at a similar position in a line of text are meaningful candidates 
for being matched to each other. Therefore positional information about the 
beginning and the end of the words can be used to reduce the search space of 
the previous algorithm. For each word wf the starting point sf and ending point 
ef are considered. Based on this information several distance functions can be 
defined. In our work three distance functions were used. 

In the first distance function the overlap of the words of an A^-tuple is consid- 
ered. If the words have a high degree of overlap, the distance is low. The overlap 
region between N words wj ^ , ■ • ■ , that are being matched to each other is 
computed by means of the following two values max a and mine- 



max. 



^ t 0 \ 

= max(s) ) 



N 



mine = minle^ ) 



( 8 ) 

(9) 



Then the distance function is computed by summing all parts of the words 
which do not overlap with all other words. This is done in the following way: 



N 



Doveriap{wl^,wl . - - - sj.) + (e^. - mine). (10) 

The second distance function considers the sum of the maximal distance 
between all starting points and the maximal distance between all ending points 
of the words. This can be expressed by the following formula: 
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^ max— min (w^ii : = I'maXs - min sj ] + [m^ ef - mine] ■ ( 1 1 ) 

In the third distance function the maximum among all distances between 
starting points sf and ending points sf is computed. This is done as follows: 



D 



max—dist 



{wi, 






max(|s^ 

j,k 




( 12 ) 



In the alignment algorithm, for each A^-tuple of words the positional distance 
is computed using one of the distances defined in eqs. (10-12). If this distance is 
larger than a predefined threshold 9, the A^-tuple of words is 

not considered as an appropriate candidate for alignment, i.e., the corresponding 
d- value is not computed. By varying the threshold 9, various degrees of search 
space reduction can be achieved. 

Having computed the optimal alignment of a set of word sequences, each 
delivered by a different classifier, one can either take the generalized median 
sequence directly as the recognition result m, or apply ’conventional’ classi- 
fier combination techniques to each position in the aligned strings, using, for 
example, voting or some combination function on the scores of the individual 
classifiers (see Introduction). 



3 Experimental Evaluation 

In our earlier work we have developed three different recognition procedures 
for handwritten sentence recognition, which will be called i?i, i?2 and in 
the following. Ri is a segmentation free recognizer that takes a complete line 
of handwritten text as input and yields a sequence of words as output. This 
recognizer is based on hidden Markov models (HMMs). i?2 is similar to i?i, but 
while i?i processes lines of text in the normal left-to-right order, i?2 processes 
them from right to left. As the Viterbi decoding algorithm used in our HMMs is 
an approximative, non-exhaustive search procedure, it can be expected that i?i 
and i?2 yield different results on an input sequence. R 3 is a segmentation based 
recognizer. It uses a number of heuristic rules to extract the individual words 
from a line of handwritten text. Classification of these individual words is again 
based on HMMs. As i ?3 is significantly different from Ri and i ?2 regarding the 
segmentation of a line of text into individual words, it can be expected that the 
behavior of R 3 is different from Ri and i?2- For a detailed description of these 
three classifiers see mm- 

The images of the handwritten texts used for the experiments are a subset 
(c03-*[a-f]) of the lAM-database [E3]. Altogether 59 text pages containing 541 
text lines, with a total of 4523 word instances out of a vocabulary of 412 words 
are used. For an example of a line of Text from this database see Fig.Q The data 
are split into five subsets of approximately equal size. Each subset is used once 
for testing, while the others are taken for training of the HMM-based recognizers. 
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Fig. 2. Error, search space and time behaviour for the distance function D overlap- 



On this data set a correct alignment rate of 63.64% was achieved on the word 
level. This means that 63.64% of all handwritten words were aligned in such a 
way that the words output by Ri, R 2 , and R 3 were identical and correct. This 
result was obtained by using the full search space for the alignment as described 

in |0|. 

Note that the correct alignment rate of 63.64% can be regarded as a lower 
bound on the recognition rate of any classifier combination procedure that is 
based on the alignment method described in this paper. The lower bound will 
be produced by a combiner that outputs word w at position j if all three aligned 
sequences have the same word, w, at position j, and commits an error in all 
other cases, for example, the case where two or all three words at position j are 
different. 

To simplify the graphical representation, we’ll not use the correct alignment 
rate in the following, but the alignment error. This quantity is defined as 1- 
correct alignment rate. Hence alignment error and eorrect alignment rate are 
equivalent to each other. In the experiment described above the alignment error 
is l-63.64%=36.36%. 

To test the potential of the three distance functions introduced in Section 
El for search space reduction, three experiments were conducted. In the first 
experiment we have used the distance function D overlap- For different thresholds, 
the alignment error, the size of the explored search space and the computation 
tim^l used by the algorithm were measured. The results obtained are shown 

^ Computation time was measured on Sun Microsystems Ultra 5/270. 
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Fig. 3. Error, search space and time behaviour for the distance function D 
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in Figure 0 It can be seen that if we start with a large value of the distance 
threshold 9 and decrease it (i.e. if we go from right to left along the x-axis in Fig. 
E]), the alignment error remains stable at the minimum of 36.36% until 9 Ri 1000, 
while the search space and the computation time are getting smaller. By further 
decreasing 9, search space and computation time are further decreased, but now 
the alignment error increases rapidly. Hence a threshold 9 Ri 1000 is optimal. It 
leads to a significant speed up of the algorithm without any loss in recognition 
accuracy. 

The second experiment uses the distance function Dmax-min- The alignment 
error, the size of the search space and the computation time are shown in Figure 
01 This distance function behaves similarly to F^oueriap- Again by reducing the 
distance threshold 0 to a certain value, the search space and computation time 
can be reduced without affecting the alignment error. 

Finally for the third experiment the distance function Dmax-dist is used (see 
Fig-E], which again shows a behavior similarly to the previous two experiments. 
So with all three distance functions it is possible to reduce the search space, and 
speed up the alignment algorithm, without loosing recognition performance. 

It is obvious from Figs. 0toE]that there is a trade-off between the alignment 
error and the reduction of the search space. However, it is difficult to directly 
compare the three measures Flrnaa:— mm; and Flmaa:— dzsf to each other 

based on the graphs shown in Figs. 0 to 0 But such a direct comparison is 
possible if we plot the search space versus the alignment error, see Fig. 0 From 
this figure it can be concluded that for large values of the threshold 9, i.e. for the 
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Fig. 4. Error, search space and time behaviour for the distance function Dmax-dist- 



case where more than 40% of the search space are explored, all three distance 
functions behave identically. However, for smaller values of 6, i.e. for the case 
were 40% or less of the search space is explored, Dmax-min and Dmax-dist still 
exhibit similar performance. But they are both superior to Doy^j-iap- Clearly 
Doveriap uses a larger portion of the search space in order to achieve the same 
alignment error as Dmax-min and Dmax-dist- Equivalently, it produces a lower 
alignment rate when using the same amount of search space. 

4 Conclusion 

There are many tasks in pattern recognition where the desired output is a se- 
quence of classes rather than just a single class. To make classifier combina- 
tion applicable to those tasks, an alignment of the output sequence produced 
by individual classifiers is needed. Known alignment procedures suffer from a 
high computational complexity. The main contribution of the present paper is a 
method for reduction of the search space by using positional information. Three 
different distance functions were tested to reduce the search space with this po- 
sitional information. One advantage of the proposed method is that the search 
space is not reduced in a fixed way but in a dynamic fashion, depending on the 
actual input, i.e. the word sequences to be aligned. 

In a number of experiments three recognizers for handwritten sentence recog- 
nition were used, each producing a word sequence with the positional information 
of the recognized words as output. The experiments show that over a wide range 
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Fig. 5. Comparison of the three metrics in an error-search space graph. 



of possible distance thresholds, the search space can be reduced without includ- 
ing additional errors. Corresponding to the reduction of the search space, less 
computation time is needed. Comparing the three distance functions for search 
space reduction in a graph that shows search space vs. alignment error, it can be 
seen that the functions Dmax-min and Dmax-dist behave better than Doveriap- 
In the experiments described in this paper only three different recognizers 
were involved, which means that the proposed alignment procedure was applied 
to three strings only. Also the length of the considered strings is rather short. 
But it can be anticipated that the proposed procedure is applicable to problems 
involving a larger number of longer strings. Moreover, it can be expected that 
the computational savings achieved through the use of positional information 
become even more pronounced if longer and/or more sequences are involved. 
Another potential application of the method proposed in this paper is classifier 
combination in machine printed OCR, where each classifier outputs positional 
information about the individual characters or words on a page. 
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Abstract. In many pattern recognition tasks, an approach based on combining 
classifiers has shown a significant potential gain in comparison to the perform- 
ance of an individual best classifier. This improvement turned out to be subject 
to a sufficient level of diversity exhibited among classifiers, which in general 
can be assumed as a selective property of classifier subsets. Given a large num- 
ber of classifiers, an intelligent classifier selection process becomes a crucial is- 
sue of multiple classifier system design. In this paper, we have investigated 
three evolutionary optimization methods for the classifier selection task. Based 
on our previous studies of various diversity measures and their correlation with 
majority voting error we have adopted majority voting performance computed 
for the validation set directly as a fitness function guiding the search. To prevent 
from training data overfitting we extracted a population of best unique classifier 
combinations, and used them for second stage majority voting. In this work we 
intend to show empirically, that using efficient evolutionary-based selection 
leads to the results comparable to absolutely best, found exhaustively. Moreo- 
ver, as we showed for selected datasets, introducing a second stage combining 
by majority voting has the potential for both, further improvement of the recog- 
nition rate and increase of the reliability of combined outputs. 



1 Introduction 

A research devoted to pattern recognition proves that no individual method can he 
shown to be the best for all classification tasks [1]. As a result increasing effort is 
being directed towards the development of fusion methods hoping to achieve im- 
proved and stable classification performance for a wider family of pattern recognition 
problems. Indeed, recently classifier fusion has been shown to outperform the tradi- 
tional, single-best classifier approach in many different applications [1-5]. In the 
safety critical systems, where the decisions taken are of crucial importance, any 
method offering improvement of the classification rate is invaluable even if it leads to 
higher complexity of the model. In such cases, the design of a reliable classification 
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model should start from pooling of all available classifiers, ensuring that no potentially 
supporting information is wasted. 

Given a large number of different classifiers it is always a question of how many 
and which ones to select for combining in order to achieve the highest performance of 
the fusion method. So far, a dominating approach was to pick several best classifiers, 
which commonly resulted in moderate improvement. As recently shown [4], picking 
several best classifiers does not necessarily lead to the best or sometimes even good 
solution. Further analysis revealed that in addition to individual performances, diver- 
sity among classifiers has to be taken into account for selection purposes [6,7]. Effec- 
tively, only reasonably diverse, and in particular negatively dependent classifiers could 
offer large improvement of the classification performance [8,9]. This fact imposes the 
necessity for selecting the most diverse classifiers, which are most likely to produce 
robust combined results. 

Many different scientists tried to apply different measures of diversity to select the 
best team of classifiers [7,10,11]. As shown in [11] for majority voting diversity 
measures are particularly good at reduction of system complexity but selection based 
on the diversity measures appears to be rather imprecise and limited to the lower order 
dependencies. Moreover, there are problems with consistent evaluation of the diversity 
for variable number of classifiers involved. 

An alternative to imprecise diversity-based selection is a direct search using the 
performance of combiner as selection criterion. There is no imprecision inflicted by 
diversity measures and the performance of selection process relies fully on the quality 
of searching algorithms applied. However this strategy imposes operating on an expo- 
nentially complex and rough searching space. Genetic algorithms as well as other 
evolutionary algorithms have been shown to deal well and efficiently with large, rough 
searching spaces [12,13,16-18]. 

In this paper we assume selection from a large number of classifiers and having all 
available information in the form of hardened binary classifier outputs (cor- 
rect/incorrect) obtained from classification performed over validation set. We used 
majority voting (MV) as a combination method relevant for the assumptions men- 
tioned above. Although very simple, MV quite often showed the performance compa- 
rable to the much more advanced techniques [14,15]. Moreover, as shown in [8,9] 
theoretical possibilities of classification improvement using MV are tremendous espe- 
cially using large number of classifiers. Facing a large and rough searching space, we 
applied well-known evolutionary algorithms: genetic algorithm (GA) [16], tabu search 
(TS) [17] and population-based incremental learning (PBIL) [18]. The selected algo- 
rithms represent quite different approaches of evolutionary learning. We intend to use 
these algorithms for efficient searching for a unique population of best classifier com- 
binations and combine them further to improve reliability of obtained solutions. 

The remaining of the paper is organized as follows. Section 2 explains the problem 
of classifier selection for the optimal MV performance. In section 3 we give a detailed 
analysis of the presented searching algorithms: GA, TS and PBIL and show imple- 
mentation solutions and adjustments needed to reach a satisfactory selection quality 
and compatibility with MV. Section 4 provides the results from the experiments with 
real datasets. Finally, summary and conclusions are given in section 5. 
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2 Classifier Selection for Majority Voting 

Even using simple majority voting the classifier selection process is far from trivial. 
The first problem is the representation of a single combination of classifiers, which 
ought to be uniform throughout the searching process regardless of the number of 
classifiers used. We chose binary strings used in GA’s, in which bit in j“* position 
indicated inclusion (1) or exclusion (0) of the j classifier in the further fusion process. 
Another problem is designing the fitness function. Based on our previous studies in 
[11] we decided not to use diversity measures which are vaguely correlated with MV 
performance. In our case, the searching algorithms account for system simplification 
and there is no need to use imprecise diversity measures. Therefore, we employed 
directly MV performance as a fitness function. For smaller number of classifiers, the 
quality of searching can be always inspected by comparing it with the results of an 
exhaustive search. For larger systems, due to exponential complexity, the exhaustive 
search very quickly becomes intractable and taking very rough searching space into 
account, the global optimum is rarely known. 

Given a system of M classifiers: D = {D|,...,D„j , let = (.Jy >>' 2 ; j denote a 
joint output of a system for i multidimensional input sample x, , where 
y^. = y^fX;) 1 = 1,...,JN j = 1,...,M denote the hardened output of the j classifier for 
data sample x. . In this work we assume the transformed binary outputs to be y., = 1 
for correct classification and y., =U for misclassification. Fet Vj = j 

represent a combination of classifiers, where = {U,lj indicates inclusion (1) or 
exclusion (0) of the j* classifier in the decision fusion. Given a combination v,j , the 
combined decision produced by MV combiner y"' [v,j j can be obtained by: 

MV t t _ Jo I" 1 ^ 1 / 2 J (1) 

y- z"y,v,>b:MV,/2 J 

Given a validation set Xy^ = [X|,Xj,...,x„ j, the selection can be reformulated as a 
simple optimization process where the object of optimization is v^ and the fitness 
function, which we used in this study, is represented by the following formula: 

y”(vJ-[z:,y“''(vj]/A (2) 

MV definition shown above imposes further irregularities. Namely, it enforces 
combining only odd number of classifiers. Otherwise one would have to implement a 
rejection rule observed for an equal number of contradictory votes, which brings addi- 
tional complexity to the system. Another problem is that even assuming that the global 
best validation combination is found, for the testing set it may be no longer the optimal 
selection. In order to avoid this problem, instead of obtaining single best combination, 
we intend to extract a population of best solutions and apply them all for the second 
stage combining process. In the experimental section we illustrate the advantage of the 
second stage combining, which resulted in improved classification performance and 
reduced variability of classification performance. 
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3 Searching Algorithms 

The choice of searching algorithms to be used for classifier selection has been dictated 
by several requirements. On one hand the algorithms should quickly and efficiently 
explore large searching spaces formed by a number of possible subsets of classifiers. 
Secondly, as mentioned above, combining at the second stage is to be pursued in this 
work. Therefore, rather than a single best solution, a population of best combinations 
found should be returned as an output from a searching process. Furthermore the algo- 
rithms should be rather sensitive to searching criterion as due to common high positive 
correlations among classifiers the differences in combining performance is expected to 
be small. For the reasons above, we proposed to use evolutionary algorithms operating 
directly on combining performance. Three examples of such algorithms are here in- 
vestigated and specifically implemented for the use with majority voting combiner. 



3.1 Genetic Algorithms 

Genetic algorithms (GA) have been used for a number of pattern recognition problems 
[12,13,16]. There are several problems in adopting GA to classifier selection for com- 
bining with MV. The major problem derives from the constraint of odd number of 
classifiers that has to be imposed. To keep the number of selected classifiers odd 
throughout the searching process, crossover and mutation operators have to be spe- 
cially designed. Mutation is rather easy to implement as assuming already odd number 
of classifiers set randomly in initialization process, the odd number of selected classi- 
fiers can be preserved by mutating a pair of bits or in general any even number of bits. 
Crossover is much more difficult to control that way. To avoid making GA too com- 
plex, crossover is performed traditionally and after that, if the offspring contains even 
number of classifiers one randomly selected bit is additionally mutated to bring back 
the odd number of I’s in the chromosome. To increase exploration ability of GA we 
introduced additional operator of ‘pairwise exchange’, which simply swaps random 
pair of bits within the chromosome preserving the same number of classifiers. In order 
to preserve the best combinations from generation to generation we applied a specific 
selection rule according to which populations of parents and offsprings are put to- 
gether and then a number of best chromosomes equal to the size of population is se- 
lected for the next generation. Being aware of the potential generalization problems, 
we have developed a simple diversifying operator. It enforces all chromosomes to be 
different from each other (unique), by mutating random bits until this requirement is 
reached. The whole algorithm can be defined as follows: 

1 . Initialize a random population of n chromosomes 

2. Calculate fitness (MV performance) for each chromosome 

3. Perform crossover and mutate single bits of offsprings with even number of I’s 

4. Mutate all offsprings at randomly selected one or many points 

5. Apply one or more pairwise exchanges for each offspring 

6. From all offsprings and parents select n best unique chromosomes to the next generation 

7. If convergence is reached then finish, else go to step 2 
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Although this particular version of GA represents hill-climbing algorithm, multiple 
mutation and pairwise exchange together with diversifying operator substantially ex- 
tend the exploration abilities of the algorithm. Convergence condition can be associ- 
ated with the case when no change in the mean MV error is observed for arbitrarily 
large number of generations. Preliminary experiments with real classification datasets 
confirmed superiority of the presented version of GA to its standard definition and 
highlighted the importance of diversifying operator for classifier selection process. 



3.2 Tabu Search 

Tabu search (TS) in its standard form is not a population-based algorithm yet shares 
some similarities with GA’s particularly in the encoding of the problem [17]. Instead 
of a population, it uses only single chromosome, mutated randomly at each step. Due 
to this fact there can be no crossover and the only genetic change is provided by muta- 
tion. This limits strongly an ability of the algorithm to jump into different regions of 
the searching space. Moreover, it represents a hill-climbing algorithm, which reaches 
convergence much faster than typical GA, but on the other hand, a global optimum 
may not be found, as it simply may be unreachable from initial conditions. Effectively, 
the tabu search in its original version quite easily gets trapped in local optima. To 
prevent from such effects we applied multiple consecutive mutations together with 
‘pairwise exchange’ before the fitness is examined. Similarly to GA we keep the 
population of unique best chromosomes found during the process. As for the previous 
algorithm, convergence condition is satisfied if a pool of k best solutions is not 
changed for a fixed number of generations. The presented version of TS algorithm can 
be described in the following steps: 

1 . Create a single random chromosome 

2. Mutate the chromosome at randomly selected one or many points 

3. Apply one or more pairwise exchanges 

4. Test the fitness of the new chromosome: if it is fitter than the changes are accepted 

5. Store the new chromosome if it is among k unique best solutions found so far 

6. If convergence is reached finish, else go to step 2 

3.3 Population-Based Incremental Learning 

Due to the lack of crossover operator, even after many adjustments tabu search par- 
tially loses the ability to explore the whole searching space. There is a possibility to 
regain the ability of the algorithm to reach most points of the searching space, while 
keeping convergence property at the satisfactory level. The algorithm offering these 
properties is called a population-based incremental learning (PBIL) [18]. It also uses a 
population of chromosomes, sampled from a special probability vector, which is up- 
dated at each step according to the fittest chromosomes. 

The update process of the probability vector is performed according to a standard 
supervised learning method. Given probability vector p = (p,, p^,.. .,/?„) , and popu- 
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lation of chromosomes G = (Vj, Vj,..., v^,) , where Vj = , each proh- 

ability bit is updated as in the following expression; 

p"r ^ pT + ^Pj ^Pj ^ v[(zL )/c - p, ] (3) 

where A: = 1,...,C refers to the C fittest chromosomes found and rj controls the 
magnitude of the update. A number of best chromosomes taken to update the prob- 
ability vector together with the magnitude factor rj control a balance between a speed 
of reaching the convergence and the ability to explore the whole search space. Con- 
vergence is reached if the probability vector contains only integer values: 0 or 1. In 
such a case p becomes the best combination of classifiers. As we are rather in favor 
of obtaining a population of best solutions, they are extracted and stored during the 
process preserving diversity rule as in the previous algorithms. The PBIL algorithm 
can be described in the following steps: 

1. Create probability vector of the same length as the required chromosome and initialize it 
with values of 0.5 at each bit 

2. Sample a number of chromosomes according to the probability vector 

3. Update the probability vector by increasing probabilities in positions where the fittest chro- 
mosomes had I’s 

4. Update the pool of k best unique solutions 

5. If all elements in probability vector are 0 or 1 then finish, else go to step 2 

Although PBIL algorithm does not use any genetic operators observed in GA, it 
contains a specific mechanism that allows exploiting beneficial information through 
generations, and thus preserves the stochastic elements of evolutionary algorithms. 



4 Experiments 

The experiments have been organized in two groups. In the first part the presented 
algorithms have been examined for three realistic dataset|] from UCI Repositor 5 j| and 
compared against simple alternative strategies: exhaustive search (ES), the single-best 
classifier (SB) and a random search (RS). Selection was performed from a set of 15 
different classifiers available from PRTOOLS 3.1^ Finally, in the second part of the 
experiments we investigated the possibility of combining at the second stage by com- 
bining MV outputs from the selections found as best at the first level. 

In all experiments, we used the same parameters of the algorithms, for which pre- 
liminary experiments showed the best results. Both PBIL and GA used 50 chromo- 



* Datasets: Iris - recognition of the types of iris plant: 150 samples, 4 features, 3 classes; Can- 
cer - cancer diagnosis: 569 samples, 30 features, 2 classes; Diabetes - diabetes diagnosis: 
768 samples, 8 features, 2 classes 

^ University of California Repository of Machine Learning Databases and Domain Theories, 
available free at: ftp.ics.uci.edu/pub/machine-leaming-databases 
^ Pattern Recognition Toolbox for Matlab 5.0+, implemented by R.P.W. Duin, available free 
at: ftp://ftp.ph.tn.tudelft.nl/pub/bob/prtools 
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somes in the population. In TS and GA single bit mutation has been applied together 
with single ‘pairwise exchange’ . The learning rate for PBIL was set to rj = \ . The best 
validation combinations have been examined also for a testing set to evaluate the gen- 
eralization ability. To be able to compare the algorithms in terms of efficiency, in all 
experiments the algorithms finished the run after examining a fixed number of chro- 
mosomes, which was used instead of specifying convergence conditions. 

Given a pool of M=15 different classifiers has been applied for 3 datasets from UCI 
Repository. All the datasets have been split into equally sized; training, validation and 
testing sets. Trained classifiers were then applied for a classification performed over 
the validation and testing set. Trying to reliably estimate true performances, we re- 
peated this process for many random splits of the dataset, until we obtained binary 
matrices of size N=5000 containing classification results separately for the validation 
and testing set. Searching algorithms have been applied for the validation matrix. The 
searching results for the first stage MV combining are shown in Table 1. For all pre- 
sented datasets the performances of the best selections found by the proposed search- 
ing algorithms were better than those quickly given by SB selection and were very 
close to the obtainable boundaries determined by the ES. The time of searching was 
however substantially reduced in comparison with ES. For larger number of classifiers 
ES starts to be intractable, whilst the searching time for the presented searching algo- 
rithms increases slowly. Moreover the balance between searching precision and the 
time of searching is adjustable and can be controlled by the search method parameters. 

Table 1. MV performance (BV) and an average from 50 best (BV50) selections found by the 
searching algorithms from a validation matrix (5000 xl5) obtained from classification of Iris, 
Cancer and Diabetes datasets by 15 different classifiers. The last two rows, contain testing 
matrix (5000 xl5) results: T(BV) and T(BV50) for the same selections. The time of searching 
corresponds to the time of checking 1000 different selections by each algorithm 



IRIS 


SB 


ES 


RS 


TS 


PBIL 


GA 


Time [s] 


- 


219.3 


12.69 


11.54 


34.77 


22.00 


BV [%] 


97.22 


97.82 


97.62 


97.82 


97.82 


97.82 


BV50 [%] 


- 


97.66 


97.14 


97.62 


97.65 


97.62 


T(BV) [%] 


97.02 


97.48 


97.46 


97.48 
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T(BV50) [%] 
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97.38 
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97.37 
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23.67 
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96.93 


96.99 


96.93 
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96.83 
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T(BV) [%] 
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37.68 
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76.57 
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76.57 


76.57 
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- 


76.50 


76.17 


76.45 


76.48 


76.46 


T(BV) [%] 


76.77 


77.11 


76.78 


76.86 


76.98 


77.11 


T(BV50) [%] 


- 


77.03 


76.72 


77.01 


77.05 


77.06 
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99 r Cancer dataset: TS vs. SB 




0 10 20 30 40 50 

Fig. 1. Performance of the second stage MV combiner compared against mean MV performance 
of the best 50 validation combinations and single best classifier, supported by the statistics of 
variability along different random splits of the datasets. The graphs correspond to the datasets 
from Table 1 and relate to the testing set performance. The shaded area limited by dashed lines 
together with doted line in the middle represent SB confidence intervals and the mean MV 
performance of the classifier selected by SB strategy, respectively. Grey solid line shows the 
mean MV performance of the best 50 validation selections with their confidence intervals. 
Black solid line represents MV performance of the second stage MV combining shown as a 
function of the number of the best validation combinations with corresponding confidence 
intervals. All the confidence intervals have been obtained by calculating the means over differ- 
ent splits and taking 3 times the standard deviation. 
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4.1 Experiment 2 

In this experiment, we looked at generalization ability of analyzed selection algo- 
rithms. For that purpose we examined the idea of introducing a second stage of com- 
bining and its implications for variability of obtained results. We prepared statistics of 
MV performances of the best selections obtained over consecutive splits of the exam- 
ined datasets. Calculating means and standard deviations of MV performance varying 
along different splits allowed to estimate the reliability of the selected models and 
compare it against SB approach. The results for the 50 best combinations of classifiers 
(represented by thick gray lines) and SB (represented by dotted black lines) are illus- 
trated in Fig 1. It can be seen that counting only on the best selection found for valida- 
tion set is in general risky. This is due to the generalization dilemma especially evident 
for small amount of training data. A better and more reliable strategy turned out to be 
taking the MV outputs from a number of best validation selections and obtaining a 
final decision by second-stage majority voting. The performance of the second stage 
MV combiner is shown by thick black lines in Fig. 1 . The plots show slight improve- 
ment in comparison to any individual combination and also prove that this strategy is 
much more reliable and stable in terms of different number of selections taken. Reli- 
ability improvement in comparison to SB results stems from decreased variance im- 
posed by aggregation of outputs. 



5 Conclusions 

In this paper, we studied the applicability of three evolutionary optimization tech- 
niques for the problem of classifier selection for combining by majority voting oper- 
ating on binary classification outputs. Introducing binary-strings representation of 
classifier combinations, we proposed specific implementations of genetic algorithm, 
tabu search and probability-based incremental learning applied for the constrained 
majority voting rule accepting only odd number of classifiers. Facing a huge and 
rough searching space we assigned directly majority voting performance as a fitness 
function and put the main effort to develop searching algorithms with high exploration 
capabilities and simultaneously working fast to be applicable for a large number of 
classifiers. 

Comparing the efficiency of searching with an exhaustive search we obtained 
mostly the same best selections while substantially reducing the time of searching. For 
all experiments we recorded improvement of majority voting performance of the best 
selections found in comparison with the simple single-best selection strategy. Moreo- 
ver, due to aggregation applied we observed increased reliability of the best selections 
evident in the form of reduced variance of the majority voting performance from dif- 
ferent splits of datasets. Nevertheless the best validation selection not necessarily has 
to be the best for the testing set. So can be any individual selection found among the 
best solutions. To avoid this risk we applied second stage combining applying majority 
voting for the MV outputs of the best solutions at the first stage. This strategy turned 
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out to be successful and produced the results slightly better than individual selections 
but more importantly improving also reliability and stability of the output perform- 
ance. These results allow choosing arbitrarily large number of the best selections for a 
second- stage fusion without risking dramatic loss of generalization ability, and at the 
same time preserving the general good performance of the system. 
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Abstract. Support vector machines (SVM) are learning algorithms de- 
rived from statistical learning theory. The SVM approach was originally 
developed for binary classihcation problems. In this paper SVM architec- 
tures for multi-class classihcation problems are discussed, in particular 
we consider binary trees of SVMs to solve the multi-class pattern recog- 
nition problem. Numerical results for different classihers on a benchmark 
data set handwritten digits are presented. 



1 Introduction 

Statistical learning theory developed by Vladimir Vapnik formalizes the task of 
learning from examples and describes it as a problem of statistics with finite 
sample size Q. Originally, the SVM approach was developed for two-class or 
binary classification. The A-class classification problem is defined as follows: 
Given a set of M training vectors with input vector and 

with £ {1, ■ ■ • ,N} as the class label of input . Find a decision function F : 
— >■ (1, . . . , N} mapping an input a: to a class label y. Multi-class classification 
problems (where the number of classes N is larger than 2) are often solved using 
voting schemes based on the combiniation of binary decision functions. One 
approach is constructing N binary classifiers (e.g. a SVM network), one for 
each class, together with a maximum detection across the classifier outputs to 
classifiy an input vector x. This one-against-rest strategy is widely used in the 
pattern recognition literatur. Another classification scheme is the one-against- 
one strategy, where ('^) binary classifiers are constructed — separating each pair 
of classes, together with a majority voting scheme to classify the input vectors. 
A different approach to solve a A-class pattern recognition problem is to build 
a hierachy or tree of binary classifiers. Each node of the graph is a classifier 
performing a predefined classification subtask. In this procedure the hierarchy 
of subtasks has to be determined before the classifiers are trained. 

2 Support Vector Machines 

In this section we briefly review the basic ideas of support vector learning and 
present four multi-class classification techniques which may be applied to SVMs. 
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SVMs were initially developed to classify data points of linear separable data sets 
[I6l8j . In this case a training set consisting of M examples , y^), G and 
G {—1,1} can be divided up into two sets by a separating hyperplane. Such 
a hyperplane is determined by a weight vector b G and a bias or threshold 
0 S R satisfying the separating contraints 

y>^{{xi^,b) + 0)>l y=l,...,M. 




(a) Optimal separating hyper- 
plane with a large margin. 



(b) Separating hyperplane with 
a smaller margin. 



Fig. 1. Binary classification problem. The examples of the two different classes are 
linear separable. 



The distance between the separating hyperplane and the closed data points 
of the training set is called the margin, see Figure ^ The separating hyperplane 
with maximal margin is unique and can be expressed by a linear combination of 
those training examples (so-called support vectors) lying exactly at the margin 
has the form 

M 

H{x) = Y^ aly>"{x, x^) + Oq. 

Here . . . , is the solution optimizing the functional 

M M 

Qi.a) = ^ a^a^y^^y’'{x^,x'') 

^ — 1 

subject to the constraints > 0 for all /i = 1, . . . , M and = 0- 

Then a training vector is a support vector if the corresponding coefficient 
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a* > 0. Then it is 6 = and the bias «q is determined by a single 

support vector (a:®,?/®): ajj = y® — (a:®, 6). 

The SVM approach can be extended to the nonseparable situation and to 
the regression problem. In most applications (regression or pattern recognition 
problems) linear solutions are insufficient, so it is common to define an appropri- 
ate set of nonlinear mappings g := (gi, g2, ■ ■ ■), transforming the input vectors 
into a vector g{x^) which is element of a new feature space TL. Then the 
separating hyperplane can be constructed in the feature space H. Provided H is 
a Hilbert space, the explicit mapping g{x) does not need to be known since it 
can implicitly defined by a kernel function K(x, x^) = {g{x), g{x^)) representing 
the inner product of the feature space. Using a kernel function K satisfying the 
condition of Mercer’s theorem (see 0), the separating hyperplane is given by 

M 

H{x) ^ a^y^K{x,x^) + ao- 

The coefficients can be found by solving the optimization problem 



M M 

c(«) = H “ 9 



afj,avy^y''K{x>^,x'') 



^=1 






subject to the contraints 0 < < C for all /x = 1, . . . ,M and ct^y^ = 0 

where C is a predefined positive number. An important kernel function satisfy- 
ing Mercers condition is the Gaussian kernel function (also used in this paper) 
_ II 

K{x,y) = e 2 -^^ . 

In many real world applications, e.g. speech recognition, or optical character 
recognition, a multi-class pattern recognition problem has to be solved. The 
SVM classifier is a binary classifier. Various approaches have been developed in 
order to deal with multi-class classification problems. The following strategies 
can be applied to build A^-class classifiers utilizing binary SVM classifiers. 

One-against-rest classifiers. In this method N different classifiers are con- 
structed, one classifier for each class. Here the Z-th classifier is trained on the 
whole training data set in order to classify the members of class I against the 
rest. For this, the training examples have to be re-labeled: Members of the Z-th 
class are labeled to 1; members of the other classes to —1. In the classification 
phase the classifier with the maximal output defines the estimated class label of 
the current input vector. 

One- against- one classifiers. For each possible pair of classes a binary classi- 
fier is calculated. Each classifier is trained on a subset of the training set contain- 
ing only training examples of the two involved classes. As for the one-against- 
rest strategy the training sets have to be re-labeled. All N{N — l)/2 classifiers 
are combined through a majority voting scheme to estimate the final classifica- 
tion m- Here the class with the maximal number of votes among all N (N—1 ) /2 
classifiers is the estimation. 
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Hierarchies/trees of binary SVM classifiers. Here the multi-class classifica- 
tion problem is decomposed into a series of binary classification sub-problems 
organised in a hierarchical scheme; see Figure |3 We discuss this approach in the 
next section. 



{A,B,C,D,E,F} {A,B,C,D,E,F} 





(a) Binary tree classifier (b) General hierarchical classi- 

fier 



Fig. 2. Two examples of hierarchical classifiers. The graphs are directed acyclic graphs 
with a single root node at the top of the graph and with terminal nodes (leaves) at the 
bottom. Individual classes are represented in the leaves, and the other nodes within 
the graph are classifiers performing a binary decision task, which is defined through 
the annotations of the incoming and the outgoing edges. 



Weston and Watkins proposed in Pj a natural extension to the binary SVM 
approch to solve the A^-class classification problem directly. Here re-labeling of 
the training data is not necessary. All the N classes are considered at once, 
and the separating conditions are integrated into a single optimisation problem. 
As for the one-against-rest classifiers, the result is a fV-class classifier with N 
weight vectors and N threshold values. The recall phase is organized as for the 
one-against-rest classifier strategy. 

The goal of this paper is to apply the decomposition shemes one-against- 
resf one-against-one, and tree- structured for the SVM classifier in a multi-class 
pattern recognition problem. Whereas the one-against-rest and one-against-one 
classifiers are clearly defined, the hierarchical classifier achritecture needs further 
explanation. 



3 SVM Classifier Hierachies 

One of the most important problems in multi-class pattern recognition problems 
is the existence of confusion classes. A confusion class is a subset of the set of the 
classes {!,... , N} where the feature vectors are very similar and a small amount 
of noise in the measured features may lead to misclassifications. For example, 
in OCR the measured features for members of the classes o, 0, 0 and Q are 
typically very similar, so usually {o, 0, 0, Q} defines a confusion class. The 
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major idea of hierarchical classification is first to make a coarse discrimination 
between confusion classes and then a finer discrimination within the confusion 
classes |^- 

In Figure 121 examples of hierarchical classifiers are depicted. Each node within 
the graph represents a binary classifier discriminating feature vectors of a con- 
fusion class into one of two smaller confusion classes or possibly into indivi- 
ual classes. The terminal nodes of the graph (leaves) represent these individual 
classes, and the other nodes are classifiers performing a binary decision task, 
thus these nodes have exactly two children. Nodes within the graph may have 
more than one incoming edge. Figure Efct shows a tree-structured classifier, where 
each node has exactly one incoming edge. In Figure Eb a more general classifier 
structure defined through a special directed acyclic graph is depicted. In the 
following we restrict our considerations to tree structured SVMs. 

The classification subtask is defined through the annotations of the incoming 
and outgoing edges of the node. Let us consider for example the SVM classifier at 
the root of the tree in FigureEb- The label of the incoming edge is {A , ... ,F},so 
for this (sub-)tree a 6-class classification task is given. The edges to the children 
are annotated with {A, C, D} (left child) and {B, E, F} (right child). This means 
that this SVM has to classify feature vectors into confusion class {A, C, D} or 
{B, E, F}. To achieve this, all members of the six classes {A , . . . , F} have to be 
re-labeled: Feature vectors with class labels A, C, or D get the new label — 1 
and those with class label B, E, or F get the new label 1. After this re-labeling 
procedure the SVM is trained as described in the previous section. Note, that 
re-labeling has to be done for each classifier training. 

We have not answered the question how to construct this subset-tree. One 
approach to construct such a tree is to divide the set of classes K into disjoint 
subsets Ki and K 2 utilizing clustering. In clustering and vector quantization a 
set of representative prototypes {ci, . . . , Cfe} C is determined by unsupervised 
learning from the feature vectors x^,fi= 1, . . . ,M of the training set. For each 
prototype Cj the Voronoi cells Rj and clusters Cj are defined by 

Rj := {x : \\cj - x \\2 = min \\d - x|| 2 } 

i 

and 

Cj -.= RjA{x^ : Ai = 1, . . . , M}. 

The relative frequency of members of class i in cluster j is 

For class i the set Qi is define by 

Ci = {x>" : ,M, = i}. 

The /c-means clustering with k = 2 cluster centers ci and C 2 define hyperplane 
in the feature space separating two sets of feature vectors. From the corre- 
sponding clusters Ci and C 2 a partition of the classes K into two subsets Ki 
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and K 2 can be achieved through the following assignment: 

Kj := {i€ K : j = argmax {pii.pa}}, j = 1,2. 

Recursively applied, this procedure leads to a binary tree as depicted in Figure|3 
This assignment scheme can be extended to the case k > 2. 



4 Numerical Experiment 

The data set used for evaluating the performance of the classifier consists 
of 20,000 handwritten digits (2,000 samples per class). The digits, normal- 
ized in height and width, are represented by a 16 x 16 matrix (pij) where 
Pij S {0,... ,255} is a value from a 8 bit gray scale (for details concerning 
the data set see ^). 




Fig. 3. 60 exampels of the handwritten digits data set. 



The whole data set has been divided into a set of 10,000 training samples 
and a set of 10,000 test samples. The training set has been used to design the 
classifiers, and the test set for testing the performance of the classifiers. 

For this data set we give results for the following classifiers and training pro- 
cedures: 






Tree-Structured Support Vector Machines 415 

MLP: Multilayer perceptrons with a single hidden layer of sigmoidal units 
(Fermi transfer function) trained by standard backpropagation; 100 training 
epoches; 200 hidden units. 

INN: 1-nearest neighbour classifier. 

LVQ: 1-nearest neighbour classifier trained with Kohonen’s software package 
with the OLVQl and LVQ3 training procedures each with 50 training epoches; 
500 prototypes are used. 

RBF: RBF networks with a single hidden layer of Gaussian RBFs trained 
through three phase learning P (first phase: calculating the RBF centers 
and scaling parameters through Kohonen’s learning vector quantization; second 
phase: learning the output weights by supervized gradient descent optimization 
of the mean square error function; third phase: backpropagation-like learning of 
the whole RBF architecture (centers, scaling parameters, and output weights) 
with 100 training epoches). 200 units each with a single scaling parameter are 
used in the hidden layer. 

SVM-l-R: SVM with the Gaussian kernel function; em one-against-rest strat- 
egy; NAG library for optimization has been. 

SVM- 1-1: As SVM-l-R but with the one-against-one decomposition scheme. 

SVM-TR: Binary tree of SVM networks. The classifier tree has been build by 
fc-means clustering with fc = 2. In Figure^ a representative tree is depicted which 
was found by clustering experiments. The decomposition into subclassification 
problems which is given in Figure 2] is then used for the training of the singular 
SVM classifiers. 




Fig. 4. Tree of subclasses calculated through a 2-means clustering procedure for the 
handwritten digits data set. 



For this data set further results may be found in the final StatLog report (see 
p. 135-138 in P) and in p. In the StatLog report the error rate of the best clas- 
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Table 1. Results for the handwritten digits. Here the medians of three different clas- 
sification experiments are given. 



Classifier 


MLP 


INN 


LVQ 


RBF 


SVM-l-R 


SVM-1-1 


SVM-TR 


error [%] 


2.41 


2.32 


3.01 


1.51 


1.40 


1.37 


1.39 



sifiers (quadratic polynomials, k-nearest neighbours, and multilayer perceptrons) 
is approximately 2% by using the first 40 principal components of this data. The 
error rate for the RBF networks is 5.5%, this error issignificantly higher than 
the 1.5% error rate found here, this is because in the numerical experiments 
given in Pj two phase RBF learning has been used. In [Z| it has been shown that 
the claasification performance of RBF networks can significantly be improved 
through a three phase learning scheme, calculating the centers and scaling pa- 
rameters of the RBF kernels in a first learning step through a clustering or vector 
quantization algorithm, training the output layer weights separately through a 
second supervized learning phase, and finally learning the whole architecture 
(centers, scaling parameters, and output weights) in a third backpropagation- 
like training phase. The error rates for INN, LVQ, and MLP classifiers are 
similar to the results stated in the StatLog report. The INN and LVQ classi- 
fiers perform well, RBF networks trained through three phase learning scheme 
and support vector learning show the best classification performance. Each RBF 
classifier architecture trained through three phase learning or support vector 
learning has been tested three times. In Tabeldthe medians of these three ex- 
periments are given. The error rates of the three different SVM architectures 
are very close together all results are in the range of 1.35-1.46%, and therefore 
a significant difference between the decomposition strategies one-against-rest, 
one- against- one, and the binary SVM classifier tree could not be found. 

5 Conclusion 

We have presented different strategies for the fV-class classification problem util- 
ising the SVM classifier approach. In detail we have discussed a novel tree struc- 
tured SVM classifier architecture. For the design of binary classifier trees we 
introduced unsupervised clustering or vector quantisation methods. We have 
presented numerical experiments on a benchmark data set of handwritten digits. 
Here, the proposed SVM tree classifier scheme shows remarkably good classifica- 
tion results which were in the range of the one-against-rest and one- against- one 
classifier architectures. For further evaluation of this tree-structured SVM clas- 
sifier numerical experiments for different multi-class problems have to be made, 
particularly on data sets with many different classes. 
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Abstract. Computer-based face perception is becoming increasingly 
important for many applications like biometric face recognition, video 
coding or multi-model human-machine interaction. Fast and robust de- 
tection and segmentation of a face in an unconstrained visual scene is a 
basic requirement for all kinds of face perception. This paper deals with 
the integration of three simple visual cues for the task of face detection 
in grey level images. It is achieved by a combination of edge orientation 
matching, hough transform and an appearance based detection method. 
The proposed system is computationally efficient and has proved to be 
robust under a wide range of acquisition conditions like varying light- 
ing, pixel noise and other image distortions. The detection capabilities 
of the presented algorithm are evaluated on a large database of 13122 
images including the frontal-face set of the m2vts database. We achieve 
a detection rate of over 91% on this database while having only few false 
detects at the same time. 



1 Introduction 



Robust and fast face detection for real-world applications such as video coding, 
multi-modal human machine interaction or biometric face recognition is still an 
open research field. For many such applications the detection should be possible 
at more than 10 frames/second which would allow online tracking of the detected 
faces. Besides that detection should be robust under a wide range of acquisition 
conditions like variations of lighting and background. 

In the past many approaches to the problem of face detection have been 
made. Most of the fast algorithms use color information for the segmentation 
of skin-tone-like areas. These areas are usually clustered and searched for facial 
features. See [ 1 1213141,1] for reference. Another widely-used class of methods for 
finding faces uses various kinds of grey level correlation approaches. The majority 
of the approaches [bl /IjSil) use a separate class for each of the faces and non-faces 
to model the problem domain. 

In this paper we will investigate a combination method of three simple match- 
ing algorithms. Due to the simple structure of each single method the processing 
is real-time (12fps for a 320 x 240 video stream). We also will present a fusion 
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method based on statistical normalization. The detection capabilities of the pre- 
sented algorithm are evaluated on a large database of 13122 images each of them 
containing one or more faces. We achieve a detrection rate of over 91% on this 
databese while having only few false detects at the same time. 

2 Edge Orientation Based Methods 

2.1 Edge Orientation Fields 

The extraction of edge information (strength and orientation) from a two- 
dimensional array of pixels I{x, y) (a grey-scale image) is the basic feature calcu- 
lation in our detection framework. In this work we use the Sobel method (see for 
example COI) for edge processing. It is a gradient-based method which needs to 
convolve the image I{x, y) with two 3x3 filter masks, for horizontal filtering 
and Kx for vertical filtering. The convolution of the image with the two filter 
masks gives two edge strength images Gx{x,y) and Gy{x,y), 



Gx{x,y) = Kx*I{x,y), 


(1) 


Gy{x,y) = Ky-kl{x,y). 


(2) 



The absolute value S{x,y), referred to as edge strength and the edge direction 
information <?(a;, y) are obtained using: 

S(x, y) = \J Gx^{x, y) + Gy^{x, y), (3) 

+ (4) 

The edge information on homogenous parts of the image where no grey value 
changes occur is often noisy and bears no useful information for the detection. 
To exclude this information we apply a threshold Tg to the edge strength S{x, y) 
generating an edge strength field ST(x,y), 

= ■ (5) 

The edge direction as stated in equation takes on values from 0 to 27 t. The 
direction of an edge depends on whether the grey value changes from dark to 
bright or vice versa. This information is irrelevant for our purposes. Therefore, 
we map the direction information to a range of value [0 . . . tt] obtaining a new 
field 



S{x,y) = 



<P{x,y) 
<P{x,y) - TT 



if 0 < ^(x, y) < TT 
if TT < <P{x, y) < 2tt 



( 6 ) 
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Fig. 1. Example of an image (a), the edge directions (b), edge strength (c), and the 
thresholded edge orientation held (d), which is used for further processing 



which we call the edge orientation field. The edge orientation information can 
be rewritten using a complex formula 

Y{x,y) = ST{x,y)e^^^^'y\ (7) 

where V (a;, y) is the complex edge orientation vector field and = —1. St(x, y) 

and ^{x, y) are obtained using equation ® and The edge orientation vector 
field can be displayed like shown in figure ^ The elements of V are referred to 
as vectors v. 



2.2 Edge Orientation Matching 

We introduced EOM (edge orientation matching) in an earlier work and 
will review the main steps here shortly. To build a face model for EOM we use 
a sample of hand-labeled face images. The faces are cropped, aligned and scaled 
to the size 32 x 40 in the grey level domain. From this set of normalized face 
images an average face is computed. We also add to the average face a vertically 
mirrored version of each face in the set . Finally the edge orientation vector 
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field Y M{x,y) is calculated from the average face. This is used as a model for 
the detection process. For face detection the model Y M{x^y) is shifted over 
the image, and at each image position {x,y) the similarity between the model 
and the underlying image patch is calculated. The image is represented by its 
orientation field V/(x, j/). In order to determine the similarity between model 
and image patch one can think of several distance metrics which either rely on 
the direction information only or use both edge strength and direction for score 
calculation. In general the orientation matching process can be described as a 
convolution-like operation like 

Co{x,y) = EE dist{YM{m,n),Y j{x + m,y + n)), (8) 

n m 

where Co (a;, y) is an image like structure containing the similarity score between 
a sub-image of size mxn and the model which is of the same size for each possible 
model position within the image. The function dist{) calculates the local distance 
between two single orientation vectors. In the present system the function dist{) 
is always designed to give a low value for a high similarity and a high value for 
poor similarity. 

The local distance function dist() is defined as a mapping of two 2- 
dimensional vectors and to [0 . . . Smax]- In our case they stem from an 
edge orientation field of the image V/ and of the model Vm. They have the 
property arg{v} = [0 . . . tt]. The upper bound of the distance s^ax occurs when 
these vectors are perpendicular and both of maximal length. The value of Smax 
depends on the normalization of the vectors Vi,Vjn- As we use 8-bit grey level- 
coded images we normalize the vectors and so we get Smax = 255. 

If one only wants to regard the edge direction information the local distance 
can be written as follows: 



dist = 



sin(| arg{vi} - arg{v„}|) • s 

max 

^max 



if |vj|,|v^|>0 
else 



(9) 



This means that only directional information is used, no matter how reliable 
it is because noisy edges usually fall below the threshold and are set to zero. 
In [11 ij we introduce more ways to compute the function dist{). There we also 
propose an elastic matching method. 



2.3 Generalized Hough Transform for Face Shape Detection 

A technique based on the generalized Hough Transform, published in H2|, is used 
for the detection of the elliptical outline of the frontal face shape. This method is 
capable of detecting the approximate position of all the ellipses within a certain 
range of variation in scale and rotation. The main idea is to perform a generalized 
Hough Transform by using an elliptical annulus as template. Actually, the direc- 
tional information allows the transformation to be implemented very efficiently, 
since the template used to update the accumulator array can be reduced to only 
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Fig. 2. Geometric construction of the templates, (a) The two possible positions of the 
center point of the ellipses depends on the direction of the vector v. (b) Representation 
of an ideal weighted template T for a, cj) = 33°. 



two sectors of the elliptical annulus. Then the technique can be described as fol- 
lows. We assume that a and b are the lengths of the semi-axes of an ellipse used 
as a reference, and that pr and pe are, respectively, the reduction and expansion 
coefficients defining the scale range: Omm = Pr ' a, bmin = Pr’b, amax = Pe'O- and 
braax = Pe ' b. As well we assume that V is the edge orientation field obtained 
in the section f2.1l and that Ch is the accumulator array or similarity map. The 
template T is a function of the angle 4> at the point (xo,yo)- The points (xi,yi) 
and (x 2 ,y 2 ) in figure El are the only two points where an ellipse tangent to v in 
(xq, yo) with semi-axes a, b could be centered. Finally, if an angular variation 9 is 
introduced in the directional information, the geometric locus T of the possible 
centers becomes: 



T= (x,y) 



pI < 






angle 



X — Xq 



a 



arctan 



y-vo 



<pI 



y-yo 

X — Xo 



< 2 ^ 



( 10 ) 



where Aangie (cc, f3) returns the smaller angle determined by a,/3. 

Figure 0 shows an example of an image and the accumulator array Ch gen- 
erated by the hough transform. 
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3 Appearance Based Matching 

In the third branch we use an appearance based matching method for face detec- 
tion first published by Yang et al. |0| • Compared to similar methods like neural 
network based, histogram based or SVM based methods, it is computationally 
efficient. This algorithm models the problem domain as a two-class problem con- 
sisting of a face class and a non-face class. Each class is trained from small 20 x 20 
grey value image-patches. Each image-patch Is{u,v) is histogram equalized be- 
fore training or classification. After this it is transformed into a feature-vector 
using 



Fs = 256 * {u * 20 + v) + Is{u,v). (11) 

Thereby we assume that the values of Is(u,v) in the grey level domain are 
restricted to 0 < Is{u, v) < 256. The classification method is the so called SNoW 
(Sparse Network of Winnows) method. It is a network of linear units which is 
close related to the perceptron [El- The similarity map produced by this method 
is called Cc. 

For this system we do not adopt the expensive bootstrapping training like 
proposed in jOj and |E|. Instead a set of 10000 manually cropped face images and 
50000 randomly generated non-face images are used for training. This results in 
considerably higher number of false positives as for example reported in |0j, but 
we can show that the number of false positives of each individual method does 
not play a key role in our fusion framework. 



4 Fusion Method 

As the similarity maps are produced by different matching algorithms they can- 
not be compared in that form. Figure 0 shows an example of such a similarity 
map of the Hough Transform (called accumulator array there). Therefore we 
want to estimate a statistical description of each matching algorithm that gives 




Fig. 3. Example of an image and the corresponding accumulator array Ch after 
processing the hough transform. Dark areas are good candidates for ellipse centers. 
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probability for a successful detection at a certain image location. This proba- 
bilities can be combined by using simple fusion strategies as shown in the next 
paragraph. 

4.1 Similarity Map Normalization 

We assume that C is a similarity map from the set [Cc, Co, Ch]- In addition 
we assume that c is an element of this map at position (x,y) , c = C(x,y) and 
c < Tmi, where is a predefined matching threshold for each cue. We can 
now define a binary event io, which describes whether c belongs to a true face 
position or a false detect. We are especially interested in the two probability 
density functions p(c | wq), which describe the probability for a similarity value 
c being associated with the correct face detection and p{c \ uji) for the opposite 
event . 

The PDF (probability density function) for c can be written as 

p{c) = p{c I uio)p{‘^o) + P{c I wi)p(wi); (12) 

where the a-priory probability p{ujq) is the recognition rate or ground truth 
of the classifier and p{uJi) = 1 —p{oJo) is the miss classification rate. This proba- 
bilities are estimated from a training data set. We use the detection results from 
the single branches as an estimate for this probabilities. 

The PDF is calculated from histograms which are obtained from a training 
data set. Figure 01 shows an example of such a histogram generated from the 
typical scores of the EOM branch. The probability of assigning the right class 
label to a measurement is according to Bayes’ theorem: 

p(wo I c) = ^°^ p(^o); (13) 

p(c) 

The PDF p{c \ loq) and p(c \ tui) are estimated on the training database. 




Fig. 4. Example for a measured histogram of the OEM obtained from 6000 test images. 
The first histogram shows the distribution for correctly matched face locations while 
the second was obtained from false matches. 
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4.2 Decision Fusion 

The final step in our fusion approach is the combination of the results obtained 
from the normalized template matchers. In order to do so simple fusion ap- 
proaches are used. We evaluate the sum rule (SUM) which is defined as 

c= I 

i 

and the product rule (PROD) 



C = ^ ^Y[p{^o \ Ci), (15) 

where i addresses the three matching methods. To obtain a final decision we 
combine the whole similarity maps to a final similarity map which is illustrated 
in figure 0 The similarity maps from the OEM and Hough Transform are fully 
computed and fused for each pixel resulting in a new similarity map Cqh- The 
values of Cq are only computed an fused if Cqh is above a certain threshold. 
This is done for efficiency reasons, because the appearance based alogorthm is 
computationally rather demanding compared to the EOM and Hough Transfor- 
mation. In order to find a face position the combined similarity map C(x,y) 
is searched for all values that are above a predefined matching threshold Tj. 
Appropriate values for Tf can only be found heuristically. 

5 Experiments and Results 

5.1 Database 

We used a database of 13122 images mostly with a resolution of 384 x 288 pixels 
encoded in 256 grey levels, each of them showing one or more persons for the 
test of our system. The database also includes 2352 images from the frontal face 




Fig. 5. Fusion principle of the three proposed matching methods. 
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Fig. 6. Example images from the database used for detection tests. 



set of the m2vts database m All persons are displayed in frontal views with 
considerable variations in size and illumination. Figure Q shows some examples. 



5.2 Results 

The detection capabilities of the proposed algorithm is shown in table [D We 
regard a face as detected if the found face position and size do not deviate from 
the true values for more than a predefined small tolerance . The face position 
and size are defined by the position of the two eyes. In this work we count a face 
as detected if the eye position and eye distance do not deviate from the true 
values more than 30% in terms of the true eye distance. 

All the results reported in table Q are obtained using a resolution pyramid 
with 8 resolution levels. The size ratio of two adjacent pyramid levels is Rp = 1.25 
and the level size of the biggest level is 384 x 288. The processing is carried out 



Table 1. Detection results on our test dataset. The number of false detects can be 
lowered significantly by the fusion compared to the single methods. 



Method 


Detection 


^False Detects 


EOM 


97.7% 


18341 


Hough 


96.0% 


92866 


Appearance 


91.5% 


141489 


PROD - fusion 


91.0% 


540 


SUM - fusion 


90.8% 


567 
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in less than 100msec on a Pentium II 500Mhz, using an efficient coarse-to-fine 
search method described in CD- The size of the OEM template was 32 x 40 and 
it was created from a training sample of 200 hand labeled and normalized face 
images. The ellipse size for the Hough transform was slightly bigger than 26 x 34. 
The classifier of the appearance based method was trained from a set of 10000 
face images an 100000 non-face images of size 20 x 20 pixel. As one can see from 
tabled the number of false detects is reduced considerably by combining the 
three classifiers. The detection rate reaches a good 91% which is in between the 
rates of the best and worst single algorithm. 

6 Related Work 

There are several face processing algorithms that use edge orientation informa- 
tion or some kind of fusion for face processing. Bichsel PI uses dot products of 
orientation maps for purposes of face recognition. Burl et al. HH utilize shape 
statistics of the facial feature (eyes, nose, mouth) configuration learnd from a 
training sample. The feature candidates are detected by Orientation Template 
Correlation. Another algorithm which was partly used in this work is proposed 
by Maio and Maltoni They use the hough transform (see Sect. I2.:tll with 
a subsequent matching of a manual generated orientation models for verifica- 
tion. Their results are obtained by a small database of 70 images and they do 
not use a full multi-resolution analysis and therefore can only detect faces of a 
certain size. Recently Feraud et al. HHI published their work on the combination 
of appearance based face detection with motion and color analysis for speed-up. 
The system is reported to have a good detection performance but it is still not 
real-time. 

7 Conclusions and Future Work 

We have shown that a fusion of simple template matchers is a powerful method 
for unconstraint face detection in natural scenes. Especially the number of false 
detects is considerably lower compared to each single method. This novel com- 
bination for face detection yields a very good recognition rate of more than 91% 
on a large database of 13122 images and can be carried out in real-time on a 
standard PC. We plan to incorporate more sophisticated matching algorithms 
for further improvement of the detection capabilities. We also plan to test the 
fusion algorithm for the task of human body detection. 
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Abstract. The veto effect caused by contradicting experts outputting 
zero probability estimates leads to fusion strategies performing sub opti- 
mally. This can be resolved using Moderation. The Moderation formula 
is derived for the fc-NN classifier using a bayesian prior. The merits of 
moderation are examined on real data sets. 



1 Introduction 

Recently, the use of classifier fusion to improve accuracy of classification has 
become increasingly popular |4lbl6l/llUli2li3li4libll6| . Although many diverse 
and sophisticated strategies have been developed, there is still a considerable 
interest in simple fusion methods that do not require any training. Such methods 
can either perform at the decision level or operate directly on the soft decision 
probability outputs of the respective experts, as in Sum and Product. 

In the following we shall focus on the decision probability level fusion in gen- 
eral and on the product rule in particular. The product rule, which combines 
the multiple expert outputs by multiplication plays a prominent role because of 
its theoretically sound basis in probability calculus |9I8| . It is the proper fusion 
strategy when combining the outputs of experts utilising distinct (statistically in- 
dependent ) signal representations. It is also the optimal operator for combining 
the outputs of experts responding to an identical stimulus, under the assumption 
that the experts have been designed using statistically independent training sets. 
In spite of its theoretical underpinning, in many experimental studies Product 
has been shown to be outperformed by the less rigorously founded sum rule ■ 
The inferior performance is attributed to the veto effect. If estimation errors 
drive one of the class aposteriori probability estimates to zero, the output of the 
product fusion will also be zero, even if other experts provide a lot of support for 
the class. This severity of the product fusion strategy has motivated our previous 
research which led to the development of a heuristic modification of the classifier 
outputs before the product fusion is carried out 12]. The advocated MProduct 
which stands for Modified Product was demonstrated to be superior not only to 
Product but also to the sum rule. 

In this paper we argue that estimation errors are often caused by small sample 
problems. We show that by taking small sample effects into account we can 
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develop a formula for correcting the outputs of individual experts, provided the 
sampling distribution is known and can be incorporated as a Bayes prior. We 
develop such a correction formula for the k Nearest Neighbour (k — NN) decision 
rule. Incidentally, this rule is affected by small sample problems even when the 
size of the training set is large, as each decision is made by drawing a small 
number of samples from the training set. We then validate our correction formula 
experimentally on synthetic and real data sets. We demonstrate that Product 
using moderated outputs of multiple k — NN classifiers strongly outperforms 
the product fusion of raw classifier outputs. Finally, we compare the proposed 
scheme with the heuristic MProduct and show that they are quite similar in 
performance. The former has the advantage that the modification formula is very 
simple and adaptive to the number of nearest neighbours used by the decision 
rule. 

The paper is organised as follows. In the next section we shall introduce the 
concept of classifier output moderation. In Section 01 we focus on the k — NN 
decision rule and derive the formula for correcting the outputs of k—NN experts. 
Experiments and experimental results are presented in Sections 0 and u The 
paper is drawn to conclusion in Section El 

2 Theory 

Given R experts and m classes, the product rule assigns an input pattern vector 
X to class coj if 



n^^P,{u;j\x) = mox^^i7T^iPi(wfe|x) (1) 

where Pi(wfc|x) is an estimate of the class aposteriori probability P{ujk\x) 
delivered by expert i. Note that the estimate will be influenced by a training set, 
Xi, used for the design of expert i. Once the form of the classifier is chosen, the 
training set is then deployed to estimate the underlying model parameters de- 
noted by vector 7 ^ . A particular value of the parameter vector obtained through 
training will then define the expert output. This can be made explicit by de- 
noting the output by P(wfc|x, However, 7 ^ is only an estimate of the true 

model parameters. The estimate will be just a single realisation of the random 
variable drawn from the sampling distribution P( 7 i). If the sampling distribution 
is known a priori, then the raw estimate 

Pj(wfc|x) = P(wfc|x,Wi,7i) (2) 

can be moderated by taking the prior into consideration. In other words, a 
new estimate is obtained by integrating parameter dependent estimates over the 
model parameter space as 

P^{uJk\x) = Pi{uJk\x,Xi,ji)p{'yij)dji ( 3 ) 



This is known as marginalization in Bayesian estimation. 
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3 k-NN Classifier Output Moderation 

In Section El we argued for a moderation of raw expert outputs. The modera- 
tion is warranted for pragmatic reasons, namely to minimise the veto effect of 
overconfident erroneous classifiers. 

It is perhaps true to say that for training sets of reasonable size there should 
not be any appreciable difference between moderated and raw expert outputs. 
However, for some types of classifiers, moderation is pertinent even for sample 
sets of respectable size. An important case is the k — NN classifier. Even if the 
training set is relatively large, say hundreds of samples or more, the need for 
moderation is determined by the value of k, which may be as low as fc = 1. 
Considering just the simplest case, a two class problem, it is perfectly possible 
to draw all /c-Nearest Neighbours from the same class which means that one of 
the classes will have the expert output set to zero. In the subsequent (product) 
fusion this will then dominate the fused output and may impose a veto on the 
class even if other experts are supportive of that particular hypothesis. 

We shall now consider this situation in more detail. Suppose that we draw k- 
Nearest Neighbours and find that k of these belong to class ui. Then the unbiased 
estimate Pi{uj\x) of the aposteriori probability P(w|x) is given by 

^ (4) 

It should be noted that the actual observation k out of k could arise for any 
value of P(w|x) with the probability 

g(«) = (^)P^(u;|x)[l-P(o.|x)]'=-« (5) 

Assuming that a priori the probability P(w|x) taking any value between zero 
and one is equally likely, we can find an aposteriori estimate of the aposteriori 
class probability P(w|x) as 

^ /o'P(o;|x)P«(o;|x)[l-P(a;|x)]'=-«dP(o.|x) 

Fi(uj\K) = — T (6) 

Jg P'^(w|x)[l — P(w|x)]^“'^dP(w|x) 

where the denominator is a normalising factor ensuring that the total probability 
mass equals to one. By expanding the term [1 — P(tt>|x)]*^“'' and integrating, it 
can be easily verified that the right hand side of becomes 

«(-|x) = ^ (7) 

which is the beta distribution. Thus the moderated equivalent of | is 
Clearly our estimates of aposteriori class probabilities will never reach zero which 
could cause a veto effect. For instance, for the Nearest Neighbour classifier with 
k = 1 the smallest expert output will be |. As fc increases the smallest estimate 
will approach zero as and will assume zero only when k = oo. 

For m class problems equation o can be extended to become 

At -I- 1 
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4 Experiments 

The aim of the experiments described in this section is to confirm the theoreti- 
cally predicted benefits of fc — NN classifier output moderation. As the modera- 
tion of aposteriori class probabilities eliminates the veto effect, we would expect 
the performance of the product rule on moderated outputs to surpass the success 
rate obtained with raw class aposteriori probability estimates. 

The multiple classifiers used in the experiments are designed using training 
sets obtained by bootstrapping. Bootstrapping is a re-sampling procedure rou- 
tinely used in bagging. Each bootstrap set is derived from the original training 
set by sampling with replacement. The size of the bootstrap set is normally 
equivalent to the cardinality of the original training set. When a training data 
set is small, the proportions of training patterns from the different classes may 
be unrepresentative. The probability of drawing a training set with samples from 
some class completely missing becomes non negligible. When this occurs, bagging 
may even become counterproductive. In our earlier work |3| we investigated the 
effect of different control mechanisms over sampling to minimise the imbalance 
in the representation of the class populations in the resulting bootstrap sets. 
Three modifications of the standard bagging method were considered. The same 
sampling strategies have been adopted in the following experiments to demon- 
strate the merits of moderation. We refer to the standard procedure as method 1 
and its modified versions as methods 2-4. The methods which exploit increasing 
amounts of prior knowledge can be summarised as follows. 

— Method 1. This is the standard bagging method outlined above. 

— Method 2. When bootstrap sets are created from the learning set we check 
the ratio of the number of samples per class in the bootstrap set. This ratio is 
compared to the ratio of samples per class in the learning set. If the difference 
between the compared ratios is larger than a certain class population bias 
tolerance threshold we reject the bootstrap set. We set the bias tolerance 
threshold to 10%. 

— Method 3. This method is similar to method 2 except that the bootstrap 
set ratio is compared to the ratio in the full set. By full set we mean the set 
containing all samples, learning and test samples. This full set ratio simulates 
a prior knowledge of the class distribution in the sample space. 

— Method 4. Here we only require that all classes be represented in the boot- 
strap set, without enforcing a certain ratio of samples per class. This is done 
by rejecting any bootstrap set that does not represent all classes. 



4.1 Experimental Methodology 

A single training set is randomly taken from the original sample space rep- 
resented by the full data set. The k — NN classifier built using this original 
learning set is referred to as the single expert. The remaining samples are used 
as a test set. Using the learning set, 25 boot sets are generated by bootstrapping. 
The decision of the 25 boot sets are aggregated to classify the test set. These 
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results are referred to as the bagged expert results. We compare these results to 
those obtained from the single expert, and to those obtained from other bagging 
methods. The above is repeated for three training set sizes. The sizes used were 
20, 40, and 80 samples. We concentrate on this range of training set sizes for the 
following reasons. For the data sets used in our experiments the upper cardinal- 
ity corresponds to a sufficiently large data set for which the k — NN classifier 
becomes stable. The lower end of the spectrum represents the case when the 
main assumption underpinning the k — NN decision rule, i.e. that the aposte- 
riori class probability P(w|x) at all k Nearest Neighbours is the same, breaks 
down. Thus the focus of our experimentation is on the range between these two 
limiting values. 

We measure the performance of the four methods of creating bootstrap sets 
for two types of learning sets. In the first case the learning set is created by 
randomly taking samples from the full data set. This results in a set that may 
contain samples from all classes with a population bias towards a certain class. 
The second type of learning set is referred to as a modified learning set. It 
is constructed using Method 3 which was mentioned as a technique to create 
unbiased bootstrap sets at the beginning of Section 0] This results in a set that 
is representative of all the classes, with the class population ratios similar to 
those of the full set. The modified learning set simulates an unbiased sample 
space. 

All experiments are repeated 100 times and we average the error rates by 
dividing by the number of repetitions. We compare the moderation results with 
the results obtained using non-moderated k — NN experts. We also compare 
the moderation results with results obtained using the modified product fusion 
strategy proposed in |2|. 

To find the misclassification error rate of a single expert, a test sample is 
presented to the k — NN classifier. The class posterior probabilities for each test 
sample are estimated in the non-moderated and moderated case using formulas 
El and El respectively. 

The test sample is assigned a class label that corresponds to the largest 
posterior probability. If the original label of the sample is found to be different 
from the assigned label the error counter is incremented. This is repeated for 
all samples in the test set. After presenting all test samples the error counter is 
divided by the number of test samples used, in order to get the misclassification 
rate. 

In order to see the effect of Moderation we adopt a simple relative perfor- 
mance measure defined in equation 0 It relates the difference between two re- 
sults, for example results obtained using moderated and non-moderated systems, 
to the classification rate of the latter. In this way we will be able to reflect any 
improvements or degradations in performance even if the baseline classification 
rates are quite high. The relative performance measure s is given as 

Cm 



S = 



X 100 



(9) 
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where Cm is the moderated classification rate, or the new result, and c„ is the 
unmoderated classification rate, or the baseline result. If the improvement or 
degradation exceeds 5 % we consider it as significant. This value is calculated 
for all four bagging methods and each of the two learning set types under varying 
set sizes. 

Both synthetic and real data sets were used in our experiments. Synthetic 
data was chosen to carry out controlled experiments for which the achievable 
recognition rate is known. The computer generated data is two dimensional 
involving two classes. The two class densities have an overlap area which was 
designed to achieve an instability of the class boundary. The theoretical Bayes 
error of this data set is 6.67%. Most of the real data used were the standard 
sets obtained from the UCI repository m- The exception is the seismic data 
set made available by Shell. Table Q] summarises the essential information about 
these data sets. 



Table 1. Data sets used and the number of samples available in each data set. 



Data Name 


No. of 
samples 


No. of 
features 


No. of 
classes 


Data Name 


No. of 
samples 


No. of 
features 


No. of 
classes 


Synthetic 


1231 


2 


2 


BCW 


699 


9 


2 


Seismic 


300 


25 


3 


Ionosphere 


351 


34 


2 


Wine 


178 


13 


3 


Iris 


150 


4 


3 



5 Results 

Table 0d isplays the baseline classification rates of the single expert and bagging 
using the product rule with raw aposteriori class probability estimates. These 
results have been analysed and discussed in detail elsewhere |3| . For the purpose 
of this paper we only note the main points, namely that 

— bagging does not improve the k — NN rule performance for sufficiently large 
training sets (in excess of 80 samples) 

— for smaller sample sets, bootstrapping and aggregation of moderated esti- 
mates of aposteriori class probabilities via product can be useful 

— for very small training and bootstrap sets created by means of regular sam- 
pling, bagging can lead to degradation in performance 

The benefits of moderating the k — NN classifier outputs can be gleaned from 
Figures era) and n)b) which plot the relative performance measure defined in 0 
In Figure era) we show the improvement gained over the single expert whereas 
Figure db) relates the product aggregation of moderated k — NN outputs to the 
product of raw outputs. We note that for training sets of size less than 80 samples 
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the performance improves significantly. The gains are inversely proportional to 
the training set size. Bagging with moderation can largely compensate for the 
lack of training data. 

The method for moderating the outputs of the k — NN classifier advocated 
in this paper is based on the principles of sampling with a Bayesian prior. The 
Modified Product (MProduct) proposed in [2, has the same motivation, i.e. 
eliminating the veto effect of the product fusion rule that can be caused by raw 
aposteriori class probability estimates. However, MProduct is heuristic and it 



Table 2. Classification rate of the single expert and bagging using Product of non- 
moderated aposteriori class probability estimates 



Data 


Learn 


Bagging 


Learning samples 


Data 


Learn 


Bagging 


Learning samples 


set 


set type 


method 


20 


40 


80 


set 


set type 


method 


20 


40 


80 


Seis. 


Regular 


1 


89.87 


98.16 


99.17 


BCW 


Regular 


1 


90.05 


94.63 


95.35 






2 


95.29 


98.12 


99.13 






2 


91.69 


94.68 


95.35 






3 


96.56 


98.62 


99.16 






3 


92.70 


94.78 


95.44 






4 


94.73 


98.13 


99.17 






4 


91.43 


94.62 


95.42 






Single xp 


96.41 


98.08 


99.16 






Single xp 


92.66 


94.47 


95.44 




Modified 


1 


94.83 


98.03 


99.08 




Modified 


1 


90.89 


94.79 


95.38 






2 


96.41 


98.16 


99.09 






2 


92.01 


94.86 


95.41 






3 


96.34 


98.37 


99.15 






3 


92.43 


94.88 


95.49 






4 


95.50 


97.96 


99.10 






4 


91.50 


94.76 


95.42 






Single xp 


96.90 


97.91 


99.10 






Single xp 


93.32 


94.65 


95.48 


Wine 


Regular 


1 


76.05 


91.39 


93.54 


Iris 


Regular 


1 


81.42 


93.46 


96.13 






2 


83.20 


92.09 


93.65 






2 


87.00 


93.71 


95.93 






3 


86.07 


92.37 


93.84 






3 


88.76 


94.30 


96.06 






4 


81.04 


91.80 


93.78 






4 


84.33 


93.05 


96.11 






Single xp 


85.14 


91.75 


93.56 






Single xp 


91.35 


94.07 


96.27 




Modified 


1 


79.54 


91.22 


94.28 




Modified 


1 


83.45 


93.78 


95.86 






2 


85.84 


91.74 


93.95 






2 


88.75 


94.46 


95.96 






3 


86.43 


92.01 


94.26 






3 


89.22 


94.68 


96.03 






4 


82.72 


91.09 


94.16 






4 


84.52 


93.59 


95.93 






Single xp 


88.27 


91.34 


94.02 






Single xp 


92.15 


94.54 


96.13 


lono. 


Regular 


1 


68.28 


71.97 


76.00 


Synth 


Regular 


1 


88.76 


91.12 


92.28 






2 


68.24 


72.63 


75.86 






2 


89.23 


91.07 


92.27 






3 


68.31 


72.99 


75.62 






3 


89.85 


91.52 


92.36 






4 


67.40 


72.31 


75.85 






4 


88.72 


91.30 


92.29 






Single xp 


68.60 


70.25 


75.93 






Single xp 


89.34 


91.02 


92.28 




Modified 


1 


67.21 


71.76 


75.62 




Modified 


1 


89.35 


91.54 


92.23 






2 


68.10 


71.90 


75.56 






2 


90.03 


91.86 


92.27 






3 


67.68 


72.36 


76.15 






3 


90.06 


91.91 


92.42 






4 


66.97 


71.70 


75.48 






4 


89.32 


91.58 


92.28 






Single xp 


67.55 


70.22 


75.61 






Single xp 


89.97 


91.59 


92.30 



is of interest to compare its performance with the moderated output scheme 
based on theoretical foundations. In MProduct the expert output Pj{iOi\x), 
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which estimates the a posteriori probability for class oJi given pattern vector x, 
is modified before entering the product fusion rule as follows: 



Pj{uji\x) =t if Pj(tOi\x) < t 

Pj{iOi\x) = Pj{(Ji\x) if Pj{u!i\x) > t 



( 10 ) 




Fig. 1. (a) Moderated Product significant improvement over the single expert. 
(b)Moderated Product significant improvement over the unmoderated Product. Num- 
bers 1 to 4 on the x-axis represent bagging methods 1 to 4 using regular learning set, 
while numbers 5 to 8 represent the same bagging methods using modified learning set. 



The respective transfer functions between the raw inputs and outputs deliv- 
ered by moderation and MProduct are shown in Figure |21 The posterior proba- 
bility estimates of MProduct cover almost the full range 0 to 1, regardless of the 
value of k. In contrast, the range of moderated posterior probability estimates 
reduces as k decreases. However, these differences in transfer functions do not 
seem to translate to any significant differences in performance. In all the above 
experiments the value of k for a particular training set size was automatically 
chosen using the usual square root rule, i.e. k = \/]V where N is the size of the 
training set. This leads to using an 

odd value for k at the largest set size and even k for the smaller two sizes. We 
wanted to check the consistency of the above results as k varies. We repeated the 
same experiments for the number of Nearest Neighbours varying from 2 to N. 
Typical results are shown in Figure 0 Figure 0( a) gives the classification error 
rates for the Seismic data for a single expert, and a product fusion combining 
raw, moderated and heuristically modified k — NN classifier outputs respectively. 
Note that the performance of Product of k — NN raw outputs improves with 
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Number of rrearest rreighbors to class i 



Fig. 2. Comparison of posterior probability estimates when using Moderation to when 
using Modified Product 



increasing k. Clearly the probability of the veto effect occurring will be highest 
for the smallest k and will go down as k increases. However, the improvement 
in performance of this fusion method is undermined by the general downward 
trend of the k — NN classifier as a function of k for small training sets. Thus 
the performance curve of Product of raw outputs will peak at some point and 
then monotonically decay with increasing k. 




20 Training Samples 40 Training Samples 80 Training Samples 




94 







o 




1 


/ 


O Learning 

- - Product 

■ - ■ Moderated Prod. 

— Modified Prod 









Number of nearest neighbors, k 



Fig. 3. Comparison of classification rates of Product when expert outputs are raw, 
Moderated or Modified, all compared to the single expert. Results are for bagging 
method four using regular learning set for Seismic data (top row), and Wine data 
(bottom row) 
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The product of moderated outputs and MProduct peak at the lowest value 
of k i.e. at fc = 2 and then monotonically fall off as k increases. On average, 
MProduct is marginally better than moderation, but at the peak {k = 2) the 
curves meet. Interestingly, the single expert does extremely well for odd numbers 
of k and larger sets, but less well for smaller training sets. Most importantly, it 
does not do well for k = 2 where the probability of indecision is high. Thus one 
of the benefits of bagging is that it produces consistent performance for odd and 
even values of k. 

6 Conclusions 

The veto effect caused by contradicting experts outputting zero probability esti- 
mates leads to fusion strategies performing sub optimally. This can be resolved 
using Moderation. The Moderation formula has been derived for the fc-NN clas- 
sifier using the bayesian prior. The merits of moderation are examined on real 
data sets. Tests with different bagging methods indicate that the proposed mod- 
eration method improves the performance of Product significantly, especially 
when the size of the training set leads to sever sampling problems. 
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Abstract. We introduce an algorithm for incrementaly constructing a 
hybrid network fo radial and perceptron hidden units. The algorithm 
determine if a radial or a perceptron unit is required at a given region 
of input space. Given an error target, the algorithm also determins the 
number of hidden units. This results in a final architecture which is 
often much smaller than an RBF network or a MLP. A benchmark on 
four classification problems and three regression problems is given. The 
most striking performance improvement is achieved on the vowel data 
set 0.0 



1 Introduction 

The construction of a network architecture which contains units of different 
types at the same hidden layer is not commonly done. One reason is that such 
construction makes model selection more challenging, as it requires the deter- 
mination of each unit type in addition to the determination of network size. A 
more common approach to achieving higher architexture flexibility is via the use 
of more flexible units jT7F| . The potential problem of such a construction is over 
flexibility which leads to overfitting. 

We have introduced a training methodology for a hybrid MLP /RBF network 
0. This architecture, produced far better classification and regression results 
compared with advanced RBF methods or with MLP architectures. In this work, 
we further introduce a novel training methodology, which evaluates the need for 
additional hidden units, chooses optimaly their nature - MLP or RBF - and 
determines their optimal initial weight values. The determination of additional 
hidden units is based on an incremental strategy which searches for regions in 
input space for which the input/output function approximation leads to highest 
residual (error in case of classification). This approach, coupled with optimal 
determination of initial weight values for the additional hidden units, constructs 

^ This work was partially supported by the Israeli Ministry of Science and by the 
Israel Academy of Sciences and Humanities - Center of Excellence Program. Part of 
this work was done while N. I. was affiliated with the Institute for Brain and Neural 
Systems at Brown University and supported in part by ONR grants N00014-98-1- 
0663 and N00014-99- 1-0009. 

J. Kittler and F. Roli (Eds.): MCS 2001, LNCS 2096, pp. 440-^^^ 2001. 

@ Springer- Verlag Berlin Heidelberg 2001 
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a computationaly efficient training algorithm which appears to scale up with the 
complexity of the data, better than regular MLP or RBF methods. 



2 Motivation for Incremental Methods and the Use of a 
Hybrid MLP/RBF Networks 

There are many ways to decompose a function into a set of basis functions. 
The challenging task is to use a complete set which converges fast to the desired 
function (chosen from a sufficiently wide family of functions.) For example, while 
it is well known that MLP with as little as a single hidden layer is a universal 
approximator, namely, it can approximate any function, it is also known that 
the approximation may be very greedy, namely the number of hidden units may 
grow very large as a function of the desired approximation error. 

Analysing cases where convergence of the architecture (as a function of num- 
ber of hidden units) is slow, reveals that oten there is at least one region in input 
space where an attempt is being made to approximate a function that is radially 
symmetric (such as a donut) with projection units or vice versa. This suggests 
that an incremental architecture which chooses the appropriate hidden unit for 
different regions in input space can lead to a far smaller architecture. Earlier 
approaches, attempted to construct a small network approximation to the de- 
sired function at different regions of input space. This approach which was called 
“divide and concur” , has been studied since the eighties in the machine learning 
and connectionists community. Rather than reviwing the vast literature on that, 
we shall point out some approaches which indicate some of the highlights that 
had motivated our work. Work on trees is reviewed in where the goal is to 
reach a good division of the input space and use a very simple architecture at 
the terminating nodes. That work suggested some criteria for splitting the input 
space and provided a cost complexity method for comparing the performance 
of architectures with different size. An approach which constructs more sophis- 
ticated architectures at the terminating nodes was proposed in Eurn . where a 
gating network performs the division of the input space and small neural net- 
works perform the function approximation at each region separately. Nowlan’s 
many experiments with such architecture led him to the conclusion that it is 
better to have different type of architectures for the gating network and for the 
networks that perform the function approximation at the different regions. He 
suggested to use RBF for the gating network, and MLP for the function approx- 
imation, and thus constructed the first hybrid architecture between MLP and 
RBF. A tree approach with neural networks as terminating nodes was proposed 
by US]. The boosting algorithm jOj is another variant of the space division ap- 
proach, where the division is done based on the error performance of the given 
architecture. In contrast to previous work, this approach takes into accounts 
the geometric structure of the input data only indirectly. A remote family of 
architectures where the function approximation is constructed incrementally is 
projection pursuit m and additive models (I 1 1 rzj . 



442 



S. Cohen and N. Intrator 



If one accepts the idea of constructing a local simple architecture to differ- 
ent regions in input space, then the questions becomes, which architcture family 
should be used. The local architecture should be as simple as possible in order to 
avoid overfitting to the smaller portion of regional training data. Motivated by 
theoretical work that have studied the duality between projection-based approx- 
imation and radial kernel methods |^, we have decided to use RBF or perceptron 
units. Donoho’s work has shown that a function can be decomposed into two 
parts, the radial part and the ridge (projection based) part and that the two 
parts are mutually exclusive. It is difficult however, to separate the radial por- 
tion of a function from its projection based portion before they are estimated, 
but a sequential approach which decides on the fly, which unit to use for different 
regions in input space, has a potential to find a useful subdivision. 

The most relevant statistical framework to our proposal is Generalized Addi- 
tive Models (GAM) jl ltV2\ . In that framework, the hidden units (the components 
of the additive model) have some parametric form, usually polynomial, which is 
estimated from the data. While this model has nice statistical properties the 
additional degrees of freedom, require strong regularization to avoid over-fitting. 
Higher order networks have at least a quadratic terms in addition to the linear 
term of the projections im as a special case of GAM. 

WijXj + EE WikiXkXi + ai)+Wo (1) 

i 3 k I 

While they present a powerful extension of MLPs, and can form local or global 
features, they do so at the cost of squaring the number of input weights to the 
the hidden nodes. Flake [H| has suggested an architecture similar to GAM where 
each hidden unit has a parametric activation function which can change from a 
projection based to a radial function in a continuous way jS]. This architecture 
uses a squared activation function, thus called Squared MLP (SMTP) and only 
doubles the input dimension of the input patterns. 

Our proposed hybrid extends both MLP and RBF networks by combining 
RBF and Perceptron units in the same hidden layer. Unlike the previously de- 
scribed methods, this does not increase the number of parameters in the model, 
at the cost of predetermining the number of RBF and Perceptron units in the 
network. The hybrid network is useful especially in cases where the data includes 
some regions that contain hill-plateau and other regions that contain Gaussian 
bumps, as demonstrated in figure O The hybrid architecture 0, which we call 
Perceptron Radial Basis Net (PRBFN), automatically finds the relevant func- 
tional parts from the data concurrently, thus avoiding possible local minima that 
result from sequential methods. The first training step in the previous approach 
0 was to clusterized the data. In the next step, we tested two hypotheses for 
each cluster. If a cluster was far from radial Gaussian we rejected this hypoth- 
esis and accepted the null hypothesis. However, we had to use a threshold for 
rejecting the normal distribution hypothesis. When it was decided that a data 
cluster is likely to be normal, an RBF unit was used and otherwise a Perceptron 
(projection) unit was used. The last step was to train the hybrid network with 
full gradient descent on the full parameters. 
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Fig. 1. Data that is composed of five clusters and a sigmoidal surface. 



However, the selection based on a simple hypothesis test could be improved 
and suffered from an unclear way of estimating the hypothesis rejection thresh- 
old. Another problem with the old approach is that the number of hidden units 
has to be given to the algorithm in advance. In this paper we introduce a new al- 
gorithm that automatically selects the type of hidden units as well as the number 
of units. 

There are several approaches to set the structure of a Neural Network. The 
first one is forward selection. This approach starts with a small network 

and add units until an error goal is reached. Another approach is to start with 
a large network and im and prune un necessary units, until a given criteria is 
met. 

In this paper we use the first approach. We start with a small network and 
expand it until a given error goal is met. Thus, the algorithm determines the 
number of hidden units automatically. As noted above, a very difficult task in 
training an hybrid neural network is to find the radial and projection parts 
automatically. This problem in amplified for high dimension data, where the 
data cannot be visualized very well. We purpose a novel way to select the type 
of hidden unit automatically for regression and classification. This algorithm 
leads to smaller networks while maintaining good generalization of the resulting 
network. 
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3 Paramter Estimation and Model Selection 

An incremental architecture with more than one type of building components, 
requires three decisions at each step; (i) find the next region in input space 
where a hidden unit might be needed; (ii) decide which unit to add, an RBF or 
a perceptron; (iii) test whether the new unit is actually useful. 

The SMLP 0 network uses both RBF and Perceptron units at each cluster. 
In higher order networks ca quadratic and linear terms always exists and strong 
regularization must be used to avoid over-fitting. 

We prefer to attempt to select the proper unit for each region in input space. 
Thus, the number of hidden units is minimal and over-fitting is reduced. In order 
to select the type of units in a high dimensional space, one has to divide the space 
to regions and then to select the type of hidden unit for each region. During the 
division of the space into small region we can estimate the overall error and stop 
the splitting process when an error goal is reached. Thus, monitoring the size of 
the network as well. For these reasons there are several steps in estimating the 
parameters and structure of the hybrid network. 

We outline the algorithm’s steps to achieve these goals as follows: 

— Data clustering and splitting to reduce an error objective function. 

— Automatic selection of unit type for each cluster. 

— Full gradient descent. 

In subsequent sections we describe each step of the algorithm in more details. 

3.1 Data Clustering 

We start by clustering the data and reducing an objective error function on the 
data. The objective function can be the Entropy, for classification, or the Sum of 
Square Errors (SSE) for regression problems. The entropy serves as a measure of 
information or surprise. The less entropy a cluster has the more uniform in class 
tags are the its patterns. The SSE is equivalent to the maximum likelihood under 
Gaussian assumption about the noise. Thus, reducing the SSE is equivalent to 
the maximization of the data likelihood. 

The algorithm, that is described here, splits the cluster with the largest 
objective function (error value) into two clusters as follows. We feel that more 
work is needed in the theoretical justification of the splitting rule. This is subject 
for future work. 

— Start with the whole data as one cluster. 

— Find the cluster with the largest error. 

— For regression problems use regression split and otherwise classification split. 

— The splitting is continued until an error goal is reached or maximum number 
of clusters are achieved. 

The splitting for regression divides each cluster into two regions. One region 
is where the target function is approximated rather good, the other one where 
there is still a large error. This is done as follows: 
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— Find the pattern with the largest error in the above cluster. 

— Sort the patterns by distance to the pattern with the largest error value. 

— Split the current cluster by considering (n — 1) ways to divide the cluster by 
forming a division on the sorted patterns. 

— Choose the split with the largest reduction in the SSE. 

The splitting procedure for classification problems minimizes the Entropy cri- 
terion. Thus, splitting the cluster with the largest Entropy split the cluster with 
maximum impurity. Breiman fP has used the Gini criterion for splitting nodes 
in the CART algorithm. He has found that the two criterions are equivalents. 
We purpose the following splitting procedure for a given cluster: 

— Select the two classes with the maximum number of patterns. 

— Compute their mean and form two new clusters of these means. 

— For each pattern associate it with the nearest mean and its cluster. 

The above splitting procedures are simple and there is no need to work 
through every coordinate as done in the CART P algorithm. 



3.2 Unit Selection 



Several authors have used the Bayesian approach for model selection. MackayJE] 
uses the evidence to select a model out of a set of models. Kass and Raftery cni 
consider the Bayes Factors, for given models Ml, M2 and data set D. We follow 
this approach here by starting with the Bayesian formulation which computes 
the most probable model by integrating over the model parameters: 



p{D\M) = / p{D,w\M)dw= / p{D\w,M)p{w\M)di 
J W J W 



( 2 ) 



The Bayes Factors are then defined as: 



p{Ml\P) ^ p{D\Ml)p{Ml) 

p{M2\D) p{D\M2)p{M2)' ^ ’ 

The integration of |5|can be performed by using Laplace integral. That is, approx- 
imating the integrand by a quadratic function. Thus, the value of the integral 
becomes: 



p{m\D) ^ {2TTY\H\-^\^p{D\Wmo,M)p{Wm,\H), (4) 

Where H is the Hessian matrix of the approximation and Wmo is the most 
probable value of the likelihood p{D\M). Note that this calculation takes into 
account the performance of the model in the vicinicty of the parameter vector 
mo and is thus much more informative than a simple likelihood at mg. 

With the lack of a-priori knowledge we assume that a model with an RBF 
or a perceptron as a hidden unit is equally likely, thus: 



p{Ml) = p{M2). 
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This leads to the likelihood ratio: 

p{D\Ml) 

p{D\M2)' 

The purposed algorithm selects the type of hidden unit automatically by 
using likelihood ratio. The maximum likelihood is computed for each cluster and 
unit type. The unit type with the higher likelihood is selected. The maximum 
likelihood is defined differently for classification and regression problems and it 
is described in the next sub-sections. 



Regression unit type selection. In the regression context the maximum 
likelihood is defined as follows: 

L = ). (5) 

Where are the target values and is the output the neural network. Maxi- 
mization of the likelihood is equivalent to the minimization of the sum of squares, 
if one assumes that the function values are corrupted with a noise that is nor- 
mally distributed with given variance cr^ . This assumption is plausible when the 
noise is a sum of signals from independents source according to the Central Limit 
Theorem. 

As noted above, the selection of the unit type is done for each cluster and 
the above computation is repeated for each cluster. 

To decide between the two possible units, we project the data eigher using 
an RBF or a ridge, thus consider two 1-D data fiting problems; The ridge pro- 
jection is a monotonic increasing with the correlation between its weight vector 
and the data points. It achieves its maximum value when the correlation is max- 
imized (for a unit projection vector). Therefore, the weight vector should be 
proportional to the average over all patterns in the cluster (there is a strong as- 
sumption here about the lack of effect of the sigmoidal term on the optimization, 
but remember that this is only to choos optimal initial conditions, and perform 
model selection.) 

The RBF function, on the other hand, is monotonic decreasing with the 
distance from the maximum point. Thus, the center of the RBF is located at the 
function maximum point. Selection of the value that maximizes the likelihood 
in this case is trivial. Finally, the unit type with the higher likelihood is selected 
for the current cluster. 

The above calculation did not take into account the sign of the forward 
parameter that connects the hidden unit to the ouput layer. In order to take it 
into account, we need to calculate the values that maximze and minimize the 
likelihood for each unit, and then calculate the sign of the forward connection 
(using the simple inverse procedure of the hidden unit activity matrix) and 
choose the values which are consistent with the sign of the forward connection. 
In other words, if the forward connection turns out to be positive, the value 
which maximized the likelihood should be chosen and vice versa. 
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Thus, the above procedure is repeated with the above transformation on each 
of the target function values. Finally, the most probable unit type of both cases 
is chosen. 

Classification unit type selection. In classification, the target function has 
multiple values for each pattern. Thus, the previous technique can not be directly 
applied. For simplicity, we assume that the clustering which has been performed 
in the previous step, has purified the clusters, namely, each cluster mostly con- 
tains patterns from a single class. Under this assumption, it is reasonable to use 
the likelihood of the data points within the cluster as an indication of the fit of 
different unit types. Note, that this assumption may be too strong, especially 
in the early stages of incrementaly constructing an architecture. Relaxing this 
assumption is a subject for future work. 

To derive the most probable value of the weights, a linear approximation for 
the weights of the ridge function is used. Thus, the ridge function is approximated 
by vo^x. Hence, we wish to maximize the scalar product of w and the sum of 
the patterns in the current cluster. Since this is an unconstrained optimization, 
a Lagrange multiplier is introduced to enforce the following constrain: 



where N is the number of patterns in the current cluster. The partial derivative 
with respect to the weight vector is: 



xiFw = 1, 

arriving at the following objective function: 

N 




( 6 ) 




( 7 ) 



and the partial derivative with respect to a is 




( 8 ) 



For convenience, let Z = 



.N 

i—1 



Xi- Setting Equation Q to zero gives: 



Z = —2aw. 



Squaring both sides and using 0 gives: 

\\Zf=4a^. 



Thus, we obtain: 



or. 



2a = ±||Z||, 
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w = 




( 9 ) 



The Hessian, which is derived from Equation 0 provides the correct sign of Wj 
and ensures the maximization procedure: 



^ = 2a/, (10) 

Thus, the Hessian is a diagonal matrix, and it is negative when a is negative, 
leading to setting w as follows: 



w = 



Z 



( 11 ) 



The response of an RBF unit is proportional to the distance of patterns from 
its center. Thus, we seek to minimize the the sum of distances of the patterns 
from an unknown vector. Let us define the following objective function: 



N 

L{m) = '^{xn - mY, 



(12) 



where is a pattern vector in the current cluster and m is an unknown vector. 
The partial derivative with respect to m is 



dL 

dm 



N 

-2^(a;i - m). 
2 = 1 



(13) 



Equating to zero we arrive at: 



1 X ^ 

i=l 

Thus, we have the most probable values of the center of the RBF and weight 
of the Ridge function. We, now compute the likelihood of the cluster data for 
the two models and select the most likely model. The likelihood is defined as 

N 

p{D/M) = \[p{xi\C), (15) 

i=l 



where 



for ridge function, and 



p{xi\C) = 



1 + exp(— 



-\\x-m f 



p{xi\C) = exp( 
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for RBF function. However, since the ridge function is an improper probability 
density function, that is: 

poo 



/ p{x\C)dx yf 1 



We normalize it by the factor: 



N 
2 = 1 



Where N is the number of the patterns in the data set. 



4 Experimental Results 

This section describes regression and classification results of several variants of 
RBF and the proposed PRBFN architecture on several data sets. Since this paper 
extends our first extension into hybrid MLP/RBF network |3|, we shall denote 
the architecture resulting from the previous work as PRBFN and results from 
the construction presented in this work as PRBFN2. The following results are 
given on the test portion of each data set (full details are in |3). They represent 
are an average over 100 runs and include standard error. 



4.1 Regression 



The LogGaus data set is a composition of one ridge function and three Gaussian- 
s as follows G1 



fix) = 



1 + exp{—w'^x) 



3 

E' 

2 = 1 






2cj2 



0 , 



Where w = (1,1), the centers of the Gaussian functions are at 

(1, 1), (1, — 5), (— 4, — 2) and cr = 1. A random normally distributed noise with 
zero mean and 0.1 variance is added to the function. The whole data is com- 
posed of 441 points and it is divided randomly into two sets of 221 and 220 points 
each. The first set is served as the train set and the second one is the test set. 
All the regressors, that we have tested did not reveal the true structure of the 
data, only PRBFN2 revealed the three Gaussian-s and the ridge function. This 
fact is amplified from the results on this data set. Thus, we make the observation 
that PRBFN has high performance when the data is composed from ridge and 
Gaussian-s. If the data is composed either from Gaussian-s of ridge function it 
can reach the performance of other regressors. 

The second data-set is a 2D sine wave. 



y = 0. sin(a:i/4) sin(x2/2), 

with 200 training patterns sampled at random from an input range x\ € [0, 10] 
and X 2 € [—5, 5]. The clean data was corrupted by additive Gaussian noise with 
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cr = 0.1. The test set contains 400 noiseless samples arranged as a 20 by 20 grid 
pattern, covering the same input ranges. Orr measured the error as the total 
squared error over the 400 samples. We follow Orr and report the error as the 
SSE on the test set. 

The third data-set is a simulated alternating current circuit with four input 
dimensions (resistance R, frequency w, inductance L and capacitance C and one 
output impedance Z = . Each training set contained 200 

points sampled at random from a certain region I2n for further details]. Again, 
additive noise was added to the outputs. The experimental design is the same 
as the one used by Friedman in the evaluation of MARS jSj . Friedman’s results 
include a division by the variance of the test set targets. We follow Friedman and 
report the normalized MSE on the test set. Orr’s regression trees method izq 
outperforms the other methods on this data set. However, the PRBFN neural 
network achieves similar results to Orr’s method. 



Table 1. Comparison of Mean sqnared error results on three data sets (see for 
details). Results on the test set are given for several variants of RBF networks which 
were used also by Orr to asses RBFs. MSE Results of an average over 100 runs including 
standard deviation are presented. 





LogGauss 


2D Sine 


Friedman 


Rbf-Orr 


0.02 ±0.14 


0.91 ±0.19 


0.12±0.03 


Rbf-Matlab 


- 


0.74±0.4 


0.2 ±0.03 


Rbf-Bishop 


0.02 ±0.02 


0.53 ±0.19 


0.18 ±0.02 


PRBFN 


0.02 ±0.02 


0.53 ±0.19 


0.15 ±0.03 


PRBFN2 


0.01±0.01 


0.49 ±0.23 


0.118±0.03 



4.2 Classification 

We have used several data sets to compare the classification performance of 
the proposed methods to other RBF networks. The sonar data set attempts to 
distinguish between a mine and a rock. It was used by Gorman and Sejnowski 
m in their study of the classification of sonar signals using neural networks. 
The data has 60 continuous inputs and one binary output for the two classes. It 
is divided into 104 training patterns and 104 test patterns. The task is to train 
a network to discriminate between sonar signals that are reflected from a metal 
cylinder and those that are reflected from a similar shaped rock. There are no 
results for Bishop’s algorithm as we were not able to get it to reduce the output 
error. Gorman and Sejnowski report on results with feed- for ward architectures 
m using 12 hidden units. They achieved 90.4% 

correct classification on the test data with the angle dependent task. This 
result outperforms the results obtained by the different RBF methods, and is 
only surpassed by the proposed hybrid RBF — FF network. 



Automatic Model Selection in a Hybrid Perceptron/Radial Network 451 



The Deterding vowel recognition data gil is a widely studied benchmark. 
This problem may be more indicative of the type of problems that a real neural 
network could be faced with. The data consists of auditory features of steady 
state vowels spoken by British English speakers. There are 528 training patterns 
and 462 test patterns. Each pattern consists of 10 features and it belongs to one of 
11 classes that correspond to the spoken vowel. The speakers are of both genders. 
The best score so far was reported by Flake using his SMTP units. His average 
best score was 60.6% |B| and was achieved with 44 hidden units. Our algorithm 
achieved 68% correct classification with only 27 hidden units. As far as we know, 
it is the best result that was achieved on this data set. The waveform data set 
is a three class problem which was constructed by Brieman to demonstrate the 
performance of the Classification and Regression Trees method p. Each class 
consists of a random convex combination of two out of three waveforms sampled 
discretely with added Gaussian noise. The data set contains 5000 instance, and 
300 are used for training. Recent reports on this data-set can be found in P3E|. 
Each used a different size training set. We used the smaller training set size as 
in m who report best result of 19.1% error. The Optimal Bayes classification 
rate is 86% accuracy, the CART decision tree algorithm achieved 72% accuracy, 
and Nearest Neighbor Algorithm achieved 38% accuracy. PRBFN has achieved 
85.8% accuracy on this data set. There is not much room for improvement over 
the PRBFN classifier, in this example. 

Table 0 summarizes the percent correct classification results on the different 
data sets for the different RBF classifiers and the proposed hybrid architecture. 
As in the regression case, the STD is also given however, on the seismic data, 
due to the use of a single test set (as we wanted to see the performance on this 
particular data set only) the STD is often zero as only a single classification of 
the data was obtained in all 100 runs. 



Table 2. Percent classification results of different classifiers variants on three data sets. 



Algorithm 


Sonar 


Vowel 


waveform 


RBF-Orr 


71.7T0.5 


- 


- 


RBF-Matlab 


82.3T2.4 


51.6T2.9 


83.8T0.2 


RBF-Bishop 


- 


48.4T2.4 


83.5T0.2 


PRBFN 


91±2 


67±2 


85.8T0.2 


PRBFN2 


91.3T2 


68±2 


85T0.3 



The Protein data set imposes a difficult classification problem since originally 
the number of instance of each class diverse significantly. The input dimension 
of the patterns is 20 and there are 2255 patterns and two classes. The data set 
is divided to 1579 patterns in the train set and 676 patterns in the test set. 
The first class in the train set has only 340 instance and the second one has 
1239 instances. Thus, the a-priori probability of the first class is 0.2153 while 
the a-priori probability of the second class is 0.7847. To overcome this problem 
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we re-sample the first class patterns with normally distributed noise with mean 
zero and 0.01 variance. This data was not used in [3|. 



Table 3. Percent classification results of different classifiers variants on the Protein 
data sets. 



Algorithm 


OKProb classl 


OKProb class2 


Total OKProb 


RBF-Orr 


- ± 


- 


- 


RBF-Matlab 


- ± 


- ± 


- ± 


RBF-Bishop 


75.18T3 


79.49T1.5 


78.6±2 


PRBFN2 


77.4±5 


80.11T2 


79.56T3 



5 Discussion 

The work presented in this paper represent a major step in constructing an 
incremental hybrid architecture. It was motivated by the success of the original 
hybrid architecture which was introduced in jSj. Several assumptions were made 
in various parts of the architecture construction. Some of them are more justified 
and some require further refinement which is the subject of future work. Our 
aim was to show that even under these assumptions, an architecture that is 
smaller in size and better in generalization performance can already be achieved. 
Furthermore, while this architexture is particularly useful when the data contain 
ridge and Gaussian parts, its performance were not below the performance of 
the best known MLP or RBF networks when data that contains only one type 
of strucutre was used. 

In previous work we used hard threshold for unit type selection. The 
previous algorithm also accepted the number of hidden units in advance. This 
paper introduces an algorithm that reveals automatically the relevant parts of 
the data and maps these parts onto RBF or Ridge functions respectively. The 
algorithm also finds the number of hidden units for the network given only an 
error target. The automatic unit type detection uses the maximum likelihood 
principle in different manner for regression and classification. In regression the 
connection between the likelihood to the SSE is long known and used. In the 
classification case the output target function is not continuous and not scalar. 
The ridge function is an improper probability density function (PDF) and a 
normalization is made to transfer it into a PDF like function. 

We have tested the new architecture construction on three regression prob- 
lems and four classification problems. There are three cases where better results 
were obtained. In the extensively studied vowel data set, the proposed hybrid ar- 
chitecture achieved average results which are superior to the best known results 
[t2;II| and uses a smaller number of hidden units. On the waveform classification 
problem Q, our results are close to the Bayes limit for the data and are better 
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than the current known results. In the LogGaus data set, which is composed 
of Ridge and Gaussian parts - an excellent example for our hybrid - results 
were again improved with our proposed architecture construction. We are ex- 
cited about the ability to better model extensively studied, nonlinear data, in 
particular, demonstrate increased generalization, while keeping the number of 
the estimated parameters smaller. 

References 

1. L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Re- 
gression Trees. The Wadsworth Statistics/Probability Series, Belmont, CA, 1984. 

2. J. Buckheit and D. L. Donoho. Improved linear discrimination using time-frequency 
dictionaries. Technical Report, Stanford University, 1995. 

3. S. Cohen and N. Intrator. A hybrid projection based and radial basis function 
architecture. In J. Kittler and F. Roli, editors, Proc. Int. Workshop on Multiple 
Classifier Systems (LNCS1857), pages 147-156, Sardingia, June 2000. Springer. 

4. D. H. Deterding. Speaker Normalisation for Automatic Speech Recognition. PhD 
thesis. University of Cambridge, 1989. 

5. D. L. Donoho and I. M. Johnstone. Projection-based approximation and a duality 
with kernel methods. Annals of Statistics, 17:58-106, 1989. 

6. H. Drucker, R. Schapire, and P. Simard. Improving performance in neural networks 
using a boosting algorithm. In Steven J. Hanson, Jack D. Cowan, and C. Lee Giles, 
editors. Advances in Neural Information Processing Systems, volume 5, pages 42- 
49. Morgan Kaufmann, 1993. 

7. S. E. Fahlman and C. Lebiere. The cascade-correlation learning architecture. 
CMU-CS-90-100, Carnegie Mellon University, 1990. 

8. G. W. Flake. Square unit augmented, radially extended, multilayer percpetrons. 
In G. B. Orr and K. Miiller, editors. Neural Networks: Tricks of the Trade, pages 
145-163. Springer, 1998. 

9. J. H. Friedman. Mutltivariate adaptive regression splines. The Annals of Statistics, 
19:1-141, 1991. 

10. J. H. Friedman and W. Stuetzle. Projection pursuit regression. Journal of the 
American Statistical Association, 76:817-823, 1981. 

11. T. Hastie and R. Tibshirani. Generalized additive models. Statistical Science, 
1:297-318, 1986. 

12. T. Hastie and R. Tibshirani. Generalized Additive Models. Ghapman and Hall, 
London, 1990. 

13. T. Hastie, R. Tibshirani, and A. Buja. Flexible discriminant analysis by optimal 
scoring. Journal of the American Statistical Association, 89:1255-1270, 1994. 

14. R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of 
local experts. Neural Computation, 3(l):79-87, 1991. 

15. M. I. Jordan and R. A. Jacobs. Hierarchies of adaptive experts. In J. E. Moody, S. J. 
Hanson, and R. P. Lippmann, editors. Advances in Neural Information Processing 
Systems, volume 4, pages 985-992. Morgan Kaufmann, San Mateo, CA, 1992. 

16. R. E. Kass and A. E. Raftery. Bayes factors. Journal of The American Statistical 
Association, 90:773-795, 1995. 

17. Y. C. Lee, G. Doolen, H. H. Chen, G. Z.Sun, T. Maxwell, H.Y. Lee, and G. L. Giles. 
Machine learning using higher order correlation networks. Physica D,, 22:276-306, 
1986. 




454 



S. Cohen and N. Intrator 



18. D. J. C. MacKay. Bayesian interpolation. Neural Computation, 4(3):415-447, 1992. 

19. John Moody. Prediction risk and architecture selection for neural networks. In 
V. Cherkassky, J. H. Friedman, and H. Wechsler, editors. From Statistics to Neu- 
ral Networks: Theory and Pattern Recognition Applications. Springer, NATO ASI 
Series F, 1994. 

20. S. J. Nowlan. Soft competitive adaptation: Neural network learning algorithms 
basd on fitting statistical mixtures. Ph.D. dissertation, Carnegie Mellon University, 
1991. 

21. M. J. Orr, J. Hallman, K. Takezawa, A. Murray, S. Ninomiya, M. Oide, and 
T. Leonard. Combining regression trees and radial basis functions. Division of 
informatics, Edinburgh University, 1999. Submitted to IJNS. 

22. Gorman R. P. and Sejnowski T. J. Analysis of hidden units in a layered network 
trained to classify sonar targets. Neural Network, pages 75-89, 1988. Vol. 1. 

23. A. J. Robinson. Dynamic Error Propogation Networks. PhD thesis, University of 
Cambridge, 1989. 

24. D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal represen- 
tations by error propagation. In D. E. Rumelhart and J. L. McClelland, editors. 
Parallel Distributed Processing, volume 1, pages 318-362. MIT Press, Cambridge, 
MA, 1986. 

25. C. J. Stone. The dimensionality reduction principle for generalized additive models. 
The Annals of Statistics, 14:590-606, 1986. 




Author Index 



Adams, N.M., 136 
Alkoot, F.M., 339, 429 
Alonso Gonzalez, C.J., 43 

Benediktsson, J.A., 279 
Benfenati, E., 126 
Brambilla, R., 126 
Briem, G.J., 279 
Bruzzone, L., 259 
Bunke, H., 388 
Buxton, B.F., 11, 68 

Ghen, D., 119 
Gohen, S., 440 
Gossu, R., 259 

Dahmen, J., 109 
Damper, R.I., 369 
Debeir, O., 178 
Decaestecker, C., 178 
Dietrich, G., 378 
Dmitriev, A.V., 289 
Dodd, T.J., 369 
Dolenko, S.A., 289 
Duin, R.P.W., 1, 299, 359 

Fairhurst, M.C., 99 
Foggia, P., 208 
Fred, A., 309 
Froba, B., 418 
Frossyniotis, D.S., 198 
Fumera, G., 329 
Furlanello, G., 32 

Gabrys, B., 399 
Ghaderi, R., 148 
Giacinto, G., 78 
Gini, G., 126 
Gonzalez, J.C., 22 
Grim, J., 168 

Hand, D., 136 
Hartono, P., 188 
Hashimoto, S., 188 
Higgins, J.E., 369 
Ho, T.K., 53 



Holden, S., 11 
Hoque, M.S., 99 

Intrator, N., 440 

Jain, A.K., 88 
Jprgensen, T.M., 218 

Kelly, M.G., 136 

Keysers, D., 109 

Kittler, J., 168, 248, 339, 429 

Kuncheva, L.L, 228, 349 

Langdon, W.B., 68 
Larcher, B., 32 
Latinne, P., 178 
Linneberg, G., 218 
Liu, J., 119 
Lorenzini, M., 126 
Luttrell, S.P., 319 

Malve, L., 126 
Marcialis, G.L., 349 
Marti, U.-V., 388 
Masulli, F., 158 
Merler, S., 32 

Ney, H., 109 

Orlov, Y.V., 289 
Oza, N.G., 238 

Palm, G., 378, 409 
P§kalska, E., 359 
Persiantsev, I.G., 289 
Prabhakar, S., 88 
Pudil, P., 168 

Rodriguez Diez, J.J., 43 
Roll, F., 78, 329, 349 
Ruta, D., 399 

Sansoue, G., 208 
Sboner, A., 32 
Schwenker, F., 378, 409 
Shipp, C.A., 349 
Shugai, J.S., 289 




456 



Sirlantzis K., 99 
Skurichina, M., 1 
Smits, P.C., 269 
Somol, P., 168 
Stafylopatis, A., 198 
Suvorova, A.V., 289 
Sveinsson, J.R., 279 

Tapia, E., 22 
Tax, D.M.J., 299 
Tortorella, F., 208 
Turner, K., 238 



Valentini, G., 158 
Vento, M., 208 
Vernazza, G., 78 
Veselovsky, I.S., 289 
Villena, J., 22 

Whitaker, C.J., 228 
Wickramaratna, J., 11 
Windeatt, T., 148 
Windridge, D., 248 

Zink, W., 418 




